linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem
@ 2023-04-04 12:01 David Stevens
  2023-04-04 12:01 ` [PATCH v6 1/4] mm/khugepaged: drain lru after swapping in shmem David Stevens
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: David Stevens @ 2023-04-04 12:01 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

This series reworks collapse_file so that the intermediate state of the
collapse does not leak out of collapse_file. Although this makes
collapse_file a bit more complicated, it means that the rest of the
kernel doesn't have to deal with the unusual state. This directly fixes
races with both lseek and mincore.

This series also fixes the fact that khugepaged completely breaks
userfaultfd+shmem. The rework of collapse_file provides a convenient
place to check for registered userfaultfds without making the shmem
userfaultfd implementation care about khugepaged.

Finally, this series adds a lru_add_drain after swapping in shmem pages,
which makes the subsequent folio_isolate_lru significantly more likely
to succeed.

v5 -> v6:
 - Stop freezing the old pages so that we don't deadlock with
   mc_handle_file_pte and mincore.
 - Add missing locking around shmem charge rollback.
 - Rebase on mm-unstable (f01f73d64cb5). Beyond straightfoward
   conflicts, this involves adapting the fix for f520a742287e (i.e. an
   unhandled ENOMEM).
 - Fix bug with bounds used with vma_interval_tree_foreach.
 - Add a patch doing lru_add_drain after swapping in the shmem case.
 - Update/clarify some comments.
 - Drop ack on final patch
v4 -> v5:
 - Rebase on mm-unstable (9caa15b8a499)
 - Gather acks
v3 -> v4:
 - Base changes on mm-everything (fba720cb4dc0)
 - Add patch to refactor error handling control flow in collapse_file
 - Rebase userfaultfd patch with no significant logic changes
 - Different approach for fixing lseek race
v2 -> v3:
 - Use XA_RETRY_ENTRY to synchronize with reads from the page cache
   under the RCU read lock in userfaultfd fix
 - Add patch to fix lseek race
v1 -> v2:
 - Different approach for userfaultfd fix

*** BLURB HERE ***

David Stevens (4):
  mm/khugepaged: drain lru after swapping in shmem
  mm/khugepaged: refactor collapse_file control flow
  mm/khugepaged: skip shmem with userfaultfd
  mm/khugepaged: maintain page cache uptodate flag

 include/trace/events/huge_memory.h |   3 +-
 mm/khugepaged.c                    | 312 ++++++++++++++++-------------
 2 files changed, 171 insertions(+), 144 deletions(-)

-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v6 1/4] mm/khugepaged: drain lru after swapping in shmem
  2023-04-04 12:01 [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem David Stevens
@ 2023-04-04 12:01 ` David Stevens
  2023-04-04 12:01 ` [PATCH v6 2/4] mm/khugepaged: refactor collapse_file control flow David Stevens
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: David Stevens @ 2023-04-04 12:01 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Call lru_add_drain after swapping in shmem pages so that
isolate_lru_page is more likely to succeed.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 mm/khugepaged.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 666d2c4e38dd..90577247cfaf 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1963,6 +1963,8 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 					result = SCAN_FAIL;
 					goto xa_unlocked;
 				}
+				/* drain pagevecs to help isolate_lru_page() */
+				lru_add_drain();
 				page = folio_file_page(folio, index);
 			} else if (trylock_page(page)) {
 				get_page(page);
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 2/4] mm/khugepaged: refactor collapse_file control flow
  2023-04-04 12:01 [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem David Stevens
  2023-04-04 12:01 ` [PATCH v6 1/4] mm/khugepaged: drain lru after swapping in shmem David Stevens
@ 2023-04-04 12:01 ` David Stevens
  2023-04-04 12:01 ` [PATCH v6 3/4] mm/khugepaged: skip shmem with userfaultfd David Stevens
  2023-04-04 12:01 ` [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag David Stevens
  3 siblings, 0 replies; 10+ messages in thread
From: David Stevens @ 2023-04-04 12:01 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Add a rollback label to deal with failure, instead of continuously
checking for RESULT_SUCCESS, to make it easier to add more failure
cases. The refactoring also allows the collapse_file tracepoint to
include hpage on success (instead of NULL).

Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 230 ++++++++++++++++++++++++------------------------
 1 file changed, 113 insertions(+), 117 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 90577247cfaf..90828272a065 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1890,6 +1890,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	__SetPageLocked(hpage);
+	if (is_shmem)
+		__SetPageSwapBacked(hpage);
+	hpage->index = start;
+	hpage->mapping = mapping;
+
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
 	 * almost certainly a no-op because most of the pages must be present
@@ -1902,16 +1908,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		xas_unlock_irq(&xas);
 		if (!xas_nomem(&xas, GFP_KERNEL)) {
 			result = SCAN_FAIL;
-			goto out;
+			goto rollback;
 		}
 	} while (1);
 
-	__SetPageLocked(hpage);
-	if (is_shmem)
-		__SetPageSwapBacked(hpage);
-	hpage->index = start;
-	hpage->mapping = mapping;
-
 	/*
 	 * At this point the hpage is locked and not up-to-date.
 	 * It's safe to insert it into the page cache, because nobody would
@@ -2137,137 +2137,133 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	 */
 	try_to_unmap_flush();
 
-	if (result == SCAN_SUCCEED) {
-		/*
-		 * Replacing old pages with new one has succeeded, now we
-		 * attempt to copy the contents.
-		 */
-		index = start;
-		list_for_each_entry(page, &pagelist, lru) {
-			while (index < page->index) {
-				clear_highpage(hpage + (index % HPAGE_PMD_NR));
-				index++;
-			}
-			if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR),
-					     page) > 0) {
-				result = SCAN_COPY_MC;
-				break;
-			}
-			index++;
-		}
-		while (result == SCAN_SUCCEED && index < end) {
+	if (result != SCAN_SUCCEED)
+		goto rollback;
+
+	/*
+	 * Replacing old pages with new one has succeeded, now we
+	 * attempt to copy the contents.
+	 */
+	index = start;
+	list_for_each_entry(page, &pagelist, lru) {
+		while (index < page->index) {
 			clear_highpage(hpage + (index % HPAGE_PMD_NR));
 			index++;
 		}
+		if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), page) > 0) {
+			result = SCAN_COPY_MC;
+			goto rollback;
+		}
+		index++;
+	}
+	while (index < end) {
+		clear_highpage(hpage + (index % HPAGE_PMD_NR));
+		index++;
+	}
+
+	/*
+	 * Copying old pages to huge one has succeeded, now we
+	 * need to free the old pages.
+	 */
+	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+		list_del(&page->lru);
+		page->mapping = NULL;
+		page_ref_unfreeze(page, 1);
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		unlock_page(page);
+		put_page(page);
 	}
 
 	nr = thp_nr_pages(hpage);
-	if (result == SCAN_SUCCEED) {
-		/*
-		 * Copying old pages to huge one has succeeded, now we
-		 * need to free the old pages.
-		 */
-		list_for_each_entry_safe(page, tmp, &pagelist, lru) {
-			list_del(&page->lru);
-			page->mapping = NULL;
-			page_ref_unfreeze(page, 1);
-			ClearPageActive(page);
-			ClearPageUnevictable(page);
-			unlock_page(page);
-			put_page(page);
-		}
+	xas_lock_irq(&xas);
+	if (is_shmem)
+		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
+	else
+		__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
 
-		xas_lock_irq(&xas);
-		if (is_shmem)
-			__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
-		else
-			__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
+	if (nr_none) {
+		__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
+		/* nr_none is always 0 for non-shmem. */
+		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
+	}
+	/* Join all the small entries into a single multi-index entry. */
+	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+	xas_store(&xas, hpage);
+	xas_unlock_irq(&xas);
 
-		if (nr_none) {
-			__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
-			/* nr_none is always 0 for non-shmem. */
-			__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
-		}
-		/* Join all the small entries into a single multi-index entry. */
-		xas_set_order(&xas, start, HPAGE_PMD_ORDER);
-		xas_store(&xas, hpage);
-		xas_unlock_irq(&xas);
+	folio = page_folio(hpage);
+	folio_mark_uptodate(folio);
+	folio_ref_add(folio, HPAGE_PMD_NR - 1);
 
-		folio = page_folio(hpage);
-		folio_mark_uptodate(folio);
-		folio_ref_add(folio, HPAGE_PMD_NR - 1);
+	if (is_shmem)
+		folio_mark_dirty(folio);
+	folio_add_lru(folio);
 
-		if (is_shmem)
-			folio_mark_dirty(folio);
-		folio_add_lru(folio);
+	/*
+	 * Remove pte page tables, so we can re-fault the page as huge.
+	 */
+	result = retract_page_tables(mapping, start, mm, addr, hpage,
+				     cc);
+	unlock_page(hpage);
+	goto out;
+
+rollback:
+	/* Something went wrong: roll back page cache changes */
+	xas_lock_irq(&xas);
+	if (nr_none) {
+		mapping->nrpages -= nr_none;
+		shmem_uncharge(mapping->host, nr_none);
+	}
 
-		/*
-		 * Remove pte page tables, so we can re-fault the page as huge.
-		 */
-		result = retract_page_tables(mapping, start, mm, addr, hpage,
-					     cc);
-		unlock_page(hpage);
-		hpage = NULL;
-	} else {
-		/* Something went wrong: roll back page cache changes */
-		xas_lock_irq(&xas);
-		if (nr_none) {
-			mapping->nrpages -= nr_none;
-			shmem_uncharge(mapping->host, nr_none);
+	xas_set(&xas, start);
+	xas_for_each(&xas, page, end - 1) {
+		page = list_first_entry_or_null(&pagelist,
+				struct page, lru);
+		if (!page || xas.xa_index < page->index) {
+			if (!nr_none)
+				break;
+			nr_none--;
+			/* Put holes back where they were */
+			xas_store(&xas, NULL);
+			continue;
 		}
 
-		xas_set(&xas, start);
-		xas_for_each(&xas, page, end - 1) {
-			page = list_first_entry_or_null(&pagelist,
-					struct page, lru);
-			if (!page || xas.xa_index < page->index) {
-				if (!nr_none)
-					break;
-				nr_none--;
-				/* Put holes back where they were */
-				xas_store(&xas, NULL);
-				continue;
-			}
+		VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
 
-			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
-
-			/* Unfreeze the page. */
-			list_del(&page->lru);
-			page_ref_unfreeze(page, 2);
-			xas_store(&xas, page);
-			xas_pause(&xas);
-			xas_unlock_irq(&xas);
-			unlock_page(page);
-			putback_lru_page(page);
-			xas_lock_irq(&xas);
-		}
-		VM_BUG_ON(nr_none);
+		/* Unfreeze the page. */
+		list_del(&page->lru);
+		page_ref_unfreeze(page, 2);
+		xas_store(&xas, page);
+		xas_pause(&xas);
+		xas_unlock_irq(&xas);
+		unlock_page(page);
+		putback_lru_page(page);
+		xas_lock_irq(&xas);
+	}
+	VM_BUG_ON(nr_none);
+	/*
+	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
+	 * file only. This undo is not needed unless failure is
+	 * due to SCAN_COPY_MC.
+	 */
+	if (!is_shmem && result == SCAN_COPY_MC) {
+		filemap_nr_thps_dec(mapping);
 		/*
-		 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
-		 * file only. This undo is not needed unless failure is
-		 * due to SCAN_COPY_MC.
+		 * Paired with smp_mb() in do_dentry_open() to
+		 * ensure the update to nr_thps is visible.
 		 */
-		if (!is_shmem && result == SCAN_COPY_MC) {
-			filemap_nr_thps_dec(mapping);
-			/*
-			 * Paired with smp_mb() in do_dentry_open() to
-			 * ensure the update to nr_thps is visible.
-			 */
-			smp_mb();
-		}
+		smp_mb();
+	}
 
-		xas_unlock_irq(&xas);
+	xas_unlock_irq(&xas);
 
-		hpage->mapping = NULL;
-	}
+	hpage->mapping = NULL;
 
-	if (hpage)
-		unlock_page(hpage);
+	unlock_page(hpage);
+	put_page(hpage);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (hpage)
-		put_page(hpage);
-
 	trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
 	return result;
 }
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 3/4] mm/khugepaged: skip shmem with userfaultfd
  2023-04-04 12:01 [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem David Stevens
  2023-04-04 12:01 ` [PATCH v6 1/4] mm/khugepaged: drain lru after swapping in shmem David Stevens
  2023-04-04 12:01 ` [PATCH v6 2/4] mm/khugepaged: refactor collapse_file control flow David Stevens
@ 2023-04-04 12:01 ` David Stevens
  2023-04-04 12:01 ` [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag David Stevens
  3 siblings, 0 replies; 10+ messages in thread
From: David Stevens @ 2023-04-04 12:01 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING. If userspace has any such userfaultfds registered, then
for any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
collapsing a shmem range would result in replacing an empty page with a
THP, to avoid breaking userfaultfd.

Synchronization when checking for userfaultfds in collapse_file is
tricky because the mmap locks can't be used to prevent races with the
registration of new userfaultfds. Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds. Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race
occurs. This makes such a race indistinguishable to the collapse
occurring immediately before the userfaultfd registration.

The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache
lock can be dropped during that loop. The second step is to fill the
gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
time, to avoid races with accesses to the page cache that only take the
RCU read lock.

The fact that we don't fill holes during the initial iteration means
that collapse_file now has to handle faults occurring during the
collapse. This is done by re-validating the number of missing pages
after acquiring the page cache lock for the final time.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory). There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause
the syscall to fail with EAGAIN.

The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
is exceedingly unlikely that anything relies on being able to observe
that transient state.

Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/trace/events/huge_memory.h |   3 +-
 mm/khugepaged.c                    | 109 +++++++++++++++++++++--------
 2 files changed, 81 insertions(+), 31 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index eca4c6f3625e..877cbf9fd2ec 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -38,7 +38,8 @@
 	EM( SCAN_TRUNCATED,		"truncated")			\
 	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
 	EM( SCAN_STORE_FAILED,		"store_failed")			\
-	EMe(SCAN_COPY_MC,		"copy_poisoned_page")
+	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
+	EMe(SCAN_PAGE_FILLED,		"page_filled")			\
 
 #undef EM
 #undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 90828272a065..7679551e9540 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -57,6 +57,7 @@ enum scan_result {
 	SCAN_PAGE_HAS_PRIVATE,
 	SCAN_COPY_MC,
 	SCAN_STORE_FAILED,
+	SCAN_PAGE_FILLED,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1856,8 +1857,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
  *    + swap/gup in pages if necessary;
- *    + fill in gaps;
  *    + keep old pages around in case rollback is required;
+ *  - finalize updates to the page cache;
  *  - if replacing succeeds:
  *    + copy data over;
  *    + free old pages;
@@ -1935,22 +1936,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 						result = SCAN_TRUNCATED;
 						goto xa_locked;
 					}
-					xas_set(&xas, index);
+					xas_set(&xas, index + 1);
 				}
 				if (!shmem_charge(mapping->host, 1)) {
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, hpage);
-				if (xas_error(&xas)) {
-					/* revert shmem_charge performed
-					 * in the previous condition
-					 */
-					mapping->nrpages--;
-					shmem_uncharge(mapping->host, 1);
-					result = SCAN_STORE_FAILED;
-					goto xa_locked;
-				}
 				nr_none++;
 				continue;
 			}
@@ -2161,22 +2152,66 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		index++;
 	}
 
-	/*
-	 * Copying old pages to huge one has succeeded, now we
-	 * need to free the old pages.
-	 */
-	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
-		list_del(&page->lru);
-		page->mapping = NULL;
-		page_ref_unfreeze(page, 1);
-		ClearPageActive(page);
-		ClearPageUnevictable(page);
-		unlock_page(page);
-		put_page(page);
+	if (nr_none) {
+		struct vm_area_struct *vma;
+		int nr_none_check = 0;
+
+		i_mmap_lock_read(mapping);
+		xas_lock_irq(&xas);
+
+		xas_set(&xas, start);
+		for (index = start; index < end; index++) {
+			if (!xas_next(&xas)) {
+				xas_store(&xas, XA_RETRY_ENTRY);
+				if (xas_error(&xas)) {
+					result = SCAN_STORE_FAILED;
+					goto immap_locked;
+				}
+				nr_none_check++;
+			}
+		}
+
+		if (nr_none != nr_none_check) {
+			result = SCAN_PAGE_FILLED;
+			goto immap_locked;
+		}
+
+		/*
+		 * If userspace observed a missing page in a VMA with a MODE_MISSING
+		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for that
+		 * page. If so, we need to roll back to avoid suppressing such an
+		 * event. Since wp/minor userfaultfds don't give userspace any
+		 * guarantees that the kernel doesn't fill a missing page with a zero
+		 * page, so they don't matter here.
+		 *
+		 * Any userfaultfds registered after this point will not be able to
+		 * observe any missing pages due to the previously inserted retry
+		 * entries.
+		 */
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, end) {
+			if (userfaultfd_missing(vma)) {
+				result = SCAN_EXCEED_NONE_PTE;
+				goto immap_locked;
+			}
+		}
+
+immap_locked:
+		i_mmap_unlock_read(mapping);
+		if (result != SCAN_SUCCEED) {
+			xas_set(&xas, start);
+			for (index = start; index < end; index++) {
+				if (xas_next(&xas) == XA_RETRY_ENTRY)
+					xas_store(&xas, NULL);
+			}
+
+			xas_unlock_irq(&xas);
+			goto rollback;
+		}
+	} else {
+		xas_lock_irq(&xas);
 	}
 
 	nr = thp_nr_pages(hpage);
-	xas_lock_irq(&xas);
 	if (is_shmem)
 		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
 	else
@@ -2206,6 +2241,20 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	result = retract_page_tables(mapping, start, mm, addr, hpage,
 				     cc);
 	unlock_page(hpage);
+
+	/*
+	 * The collapse has succeeded, so free the old pages.
+	 */
+	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+		list_del(&page->lru);
+		page->mapping = NULL;
+		page_ref_unfreeze(page, 1);
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		unlock_page(page);
+		put_page(page);
+	}
+
 	goto out;
 
 rollback:
@@ -2217,15 +2266,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	}
 
 	xas_set(&xas, start);
-	xas_for_each(&xas, page, end - 1) {
+	end = index;
+	for (index = start; index < end; index++) {
+		xas_next(&xas);
 		page = list_first_entry_or_null(&pagelist,
 				struct page, lru);
 		if (!page || xas.xa_index < page->index) {
-			if (!nr_none)
-				break;
 			nr_none--;
-			/* Put holes back where they were */
-			xas_store(&xas, NULL);
 			continue;
 		}
 
@@ -2749,12 +2796,14 @@ static int madvise_collapse_errno(enum scan_result r)
 	case SCAN_ALLOC_HUGE_PAGE_FAIL:
 		return -ENOMEM;
 	case SCAN_CGROUP_CHARGE_FAIL:
+	case SCAN_EXCEED_NONE_PTE:
 		return -EBUSY;
 	/* Resource temporary unavailable - trying again might succeed */
 	case SCAN_PAGE_COUNT:
 	case SCAN_PAGE_LOCK:
 	case SCAN_PAGE_LRU:
 	case SCAN_DEL_PAGE_LRU:
+	case SCAN_PAGE_FILLED:
 		return -EAGAIN;
 	/*
 	 * Other: Trying again likely not to succeed / error intrinsic to
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag
  2023-04-04 12:01 [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem David Stevens
                   ` (2 preceding siblings ...)
  2023-04-04 12:01 ` [PATCH v6 3/4] mm/khugepaged: skip shmem with userfaultfd David Stevens
@ 2023-04-04 12:01 ` David Stevens
  2023-04-04 21:21   ` Peter Xu
  2023-06-20 20:55   ` Andres Freund
  3 siblings, 2 replies; 10+ messages in thread
From: David Stevens @ 2023-04-04 12:01 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Hugh Dickins
  Cc: Andrew Morton, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Make sure that collapse_file doesn't interfere with checking the
uptodate flag in the page cache by only inserting hpage into the page
cache after it has been updated and marked uptodate. This is achieved by
simply not replacing present pages with hpage when iterating over the
target range.

The present pages are already locked, so replacing them with the locked
hpage before the collapse is finalized is unnecessary. However, it is
necessary to stop freezing the present pages after validating them,
since leaving long-term frozen pages in the page cache can lead to
deadlocks. Simply checking the reference count is sufficient to ensure
that there are no long-term references hanging around that would the
collapse would break. Similar to hpage, there is no reason that the
present pages actually need to be frozen in addition to being locked.

This fixes a race where folio_seek_hole_data would mistake hpage for
an fallocated but unwritten page. This race is visible to userspace via
data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
a similar race where pages could temporarily disappear from mincore.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: David Stevens <stevensd@chromium.org>
---
 mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 50 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 7679551e9540..a19aa140fd52 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *
  * Basic scheme is simple, details are more complex:
  *  - allocate and lock a new huge page;
- *  - scan page cache replacing old pages with the new one
+ *  - scan page cache, locking old pages
  *    + swap/gup in pages if necessary;
- *    + keep old pages around in case rollback is required;
+ *  - copy data to new page
+ *  - handle shmem holes
+ *    + re-validate that holes weren't filled by someone else
+ *    + check for userfaultfd
  *  - finalize updates to the page cache;
  *  - if replacing succeeds:
- *    + copy data over;
- *    + free old pages;
  *    + unlock huge page;
+ *    + free old pages;
  *  - if replacing failed;
- *    + put all pages back and unfreeze them;
- *    + restore gaps in the page cache;
+ *    + unlock old pages
  *    + unlock and free huge page;
  */
 static int collapse_file(struct mm_struct *mm, unsigned long addr,
@@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 	} while (1);
 
-	/*
-	 * At this point the hpage is locked and not up-to-date.
-	 * It's safe to insert it into the page cache, because nobody would
-	 * be able to map it or use it in another way until we unlock it.
-	 */
-
 	xas_set(&xas, start);
 	for (index = start; index < end; index++) {
 		page = xas_next(&xas);
@@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
 
 		/*
-		 * The page is expected to have page_count() == 3:
+		 * We control three references to the page:
 		 *  - we hold a pin on it;
 		 *  - one reference from page cache;
 		 *  - one from isolate_lru_page;
+		 * If those are the only references, then any new usage of the
+		 * page will have to fetch it from the page cache. That requires
+		 * locking the page to handle truncate, so any new usage will be
+		 * blocked until we unlock page after collapse/during rollback.
 		 */
-		if (!page_ref_freeze(page, 3)) {
+		if (page_count(page) != 3) {
 			result = SCAN_PAGE_COUNT;
 			xas_unlock_irq(&xas);
 			putback_lru_page(page);
@@ -2089,13 +2088,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		/*
-		 * Add the page to the list to be able to undo the collapse if
-		 * something go wrong.
+		 * Accumulate the pages that are being collapsed.
 		 */
 		list_add_tail(&page->lru, &pagelist);
-
-		/* Finally, replace with the new page. */
-		xas_store(&xas, hpage);
 		continue;
 out_unlock:
 		unlock_page(page);
@@ -2132,8 +2127,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		goto rollback;
 
 	/*
-	 * Replacing old pages with new one has succeeded, now we
-	 * attempt to copy the contents.
+	 * The old pages are locked, so they won't change anymore.
 	 */
 	index = start;
 	list_for_each_entry(page, &pagelist, lru) {
@@ -2222,11 +2216,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		/* nr_none is always 0 for non-shmem. */
 		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
 	}
-	/* Join all the small entries into a single multi-index entry. */
-	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
-	xas_store(&xas, hpage);
-	xas_unlock_irq(&xas);
 
+	/*
+	 * Mark hpage as uptodate before inserting it into the page cache so
+	 * that it isn't mistaken for an fallocated but unwritten page.
+	 */
 	folio = page_folio(hpage);
 	folio_mark_uptodate(folio);
 	folio_ref_add(folio, HPAGE_PMD_NR - 1);
@@ -2235,6 +2229,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		folio_mark_dirty(folio);
 	folio_add_lru(folio);
 
+	/* Join all the small entries into a single multi-index entry. */
+	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+	xas_store(&xas, hpage);
+	xas_unlock_irq(&xas);
+
 	/*
 	 * Remove pte page tables, so we can re-fault the page as huge.
 	 */
@@ -2248,47 +2247,29 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
 		list_del(&page->lru);
 		page->mapping = NULL;
-		page_ref_unfreeze(page, 1);
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
 		unlock_page(page);
-		put_page(page);
+		folio_put_refs(page_folio(page), 3);
 	}
 
 	goto out;
 
 rollback:
 	/* Something went wrong: roll back page cache changes */
-	xas_lock_irq(&xas);
 	if (nr_none) {
+		xas_lock_irq(&xas);
 		mapping->nrpages -= nr_none;
 		shmem_uncharge(mapping->host, nr_none);
+		xas_unlock_irq(&xas);
 	}
 
-	xas_set(&xas, start);
-	end = index;
-	for (index = start; index < end; index++) {
-		xas_next(&xas);
-		page = list_first_entry_or_null(&pagelist,
-				struct page, lru);
-		if (!page || xas.xa_index < page->index) {
-			nr_none--;
-			continue;
-		}
-
-		VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
-
-		/* Unfreeze the page. */
+	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
 		list_del(&page->lru);
-		page_ref_unfreeze(page, 2);
-		xas_store(&xas, page);
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
 		unlock_page(page);
 		putback_lru_page(page);
-		xas_lock_irq(&xas);
+		put_page(page);
 	}
-	VM_BUG_ON(nr_none);
 	/*
 	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM
 	 * file only. This undo is not needed unless failure is
@@ -2303,8 +2284,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		smp_mb();
 	}
 
-	xas_unlock_irq(&xas);
-
 	hpage->mapping = NULL;
 
 	unlock_page(hpage);
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag
  2023-04-04 12:01 ` [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag David Stevens
@ 2023-04-04 21:21   ` Peter Xu
  2023-04-19  4:37     ` Hugh Dickins
  2023-06-20 20:55   ` Andres Freund
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Xu @ 2023-04-04 21:21 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Hugh Dickins, Andrew Morton, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Jiaqi Yan,
	linux-kernel

On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
> 
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
> 
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
> 
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  mm/khugepaged.c | 79 ++++++++++++++++++-------------------------------
>  1 file changed, 29 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 7679551e9540..a19aa140fd52 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1855,17 +1855,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>   *
>   * Basic scheme is simple, details are more complex:
>   *  - allocate and lock a new huge page;
> - *  - scan page cache replacing old pages with the new one
> + *  - scan page cache, locking old pages
>   *    + swap/gup in pages if necessary;
> - *    + keep old pages around in case rollback is required;
> + *  - copy data to new page
> + *  - handle shmem holes
> + *    + re-validate that holes weren't filled by someone else
> + *    + check for userfaultfd

PS: some of the changes may belong to previous patch here, but not
necessary to repost only for this, just in case there'll be a new one.

>   *  - finalize updates to the page cache;
>   *  - if replacing succeeds:
> - *    + copy data over;
> - *    + free old pages;
>   *    + unlock huge page;
> + *    + free old pages;
>   *  - if replacing failed;
> - *    + put all pages back and unfreeze them;
> - *    + restore gaps in the page cache;
> + *    + unlock old pages
>   *    + unlock and free huge page;
>   */
>  static int collapse_file(struct mm_struct *mm, unsigned long addr,
> @@ -1913,12 +1914,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  	} while (1);
>  
> -	/*
> -	 * At this point the hpage is locked and not up-to-date.
> -	 * It's safe to insert it into the page cache, because nobody would
> -	 * be able to map it or use it in another way until we unlock it.
> -	 */
> -
>  	xas_set(&xas, start);
>  	for (index = start; index < end; index++) {
>  		page = xas_next(&xas);
> @@ -2076,12 +2071,16 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		VM_BUG_ON_PAGE(page != xas_load(&xas), page);
>  
>  		/*
> -		 * The page is expected to have page_count() == 3:
> +		 * We control three references to the page:
>  		 *  - we hold a pin on it;
>  		 *  - one reference from page cache;
>  		 *  - one from isolate_lru_page;
> +		 * If those are the only references, then any new usage of the
> +		 * page will have to fetch it from the page cache. That requires
> +		 * locking the page to handle truncate, so any new usage will be
> +		 * blocked until we unlock page after collapse/during rollback.
>  		 */
> -		if (!page_ref_freeze(page, 3)) {
> +		if (page_count(page) != 3) {
>  			result = SCAN_PAGE_COUNT;
>  			xas_unlock_irq(&xas);
>  			putback_lru_page(page);

Personally I don't see anything wrong with this change to resolve the dead
lock.  E.g. fast gup race right before unmapping the pgtables seems fine,
since we'll just bail out with >3 refcounts (or fast-gup bails out by
checking pte changes).  Either way looks fine here.

So far it looks good to me, but that may not mean much per the history on
what I can overlook.  It'll be always good to hear from Hugh and others.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag
  2023-04-04 21:21   ` Peter Xu
@ 2023-04-19  4:37     ` Hugh Dickins
  0 siblings, 0 replies; 10+ messages in thread
From: Hugh Dickins @ 2023-04-19  4:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Stevens, linux-mm, Hugh Dickins, Andrew Morton,
	Matthew Wilcox, Kirill A . Shutemov, Yang Shi, David Hildenbrand,
	Jiaqi Yan, linux-kernel

On Tue, 4 Apr 2023, Peter Xu wrote:
> On Tue, Apr 04, 2023 at 09:01:17PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> > 
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> > 
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> > 
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <stevensd@chromium.org>
...
> 
> Personally I don't see anything wrong with this change to resolve the dead
> lock.  E.g. fast gup race right before unmapping the pgtables seems fine,
> since we'll just bail out with >3 refcounts (or fast-gup bails out by
> checking pte changes).  Either way looks fine here.
> 
> So far it looks good to me, but that may not mean much per the history on
> what I can overlook.  It'll be always good to hear from Hugh and others.

I'm uneasy about it, and haven't let it sink in for long enough: but
haven't spotted anything wrong with it, nor experienced any trouble.

I would have much preferred David to stick with the current scheme, and
fix up seek_hole_data, and be less concerned with the mincore transients:
this patch makes a significant change that is difficult to be sure of.

I was dubious about the unfrozen "page_count(page) != 3" check (where
another task can grab a reference an instant later), but perhaps it
does serve a purpose, since we hold the page lock there: excludes
concurrent shmem reads which grab but drop page lock before copying
(though it's not clear that those do actually need excluding).

I had thought shmem was peculiar in relying on page lock while writing,
but turn out to be quite wrong about that: most filesystems rely on
page lock while writing, though I'm not sure whether that's true of
all (and it doesn't matter while collapse of non-shmem file is only
permitted on read-only).

We shall see.

Hugh


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag
  2023-04-04 12:01 ` [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag David Stevens
  2023-04-04 21:21   ` Peter Xu
@ 2023-06-20 20:55   ` Andres Freund
  2023-06-20 21:11     ` Peter Xu
  1 sibling, 1 reply; 10+ messages in thread
From: Andres Freund @ 2023-06-20 20:55 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Peter Xu, Hugh Dickins, Andrew Morton, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Jiaqi Yan,
	linux-kernel

Hi,

On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range.
> 
> The present pages are already locked, so replacing them with the locked
> hpage before the collapse is finalized is unnecessary. However, it is
> necessary to stop freezing the present pages after validating them,
> since leaving long-term frozen pages in the page cache can lead to
> deadlocks. Simply checking the reference count is sufficient to ensure
> that there are no long-term references hanging around that would the
> collapse would break. Similar to hpage, there is no reason that the
> present pages actually need to be frozen in addition to being locked.
> 
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> a similar race where pages could temporarily disappear from mincore.
> 
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <stevensd@chromium.org>

I noticed that recently MADV_COLLAPSE stopped being able to collapse a
binary's executable code, always failing with EAGAIN. I bisected it down to
a2e17cc2efc7 - this commit.

Using perf trace -e 'huge_memory:*' -a I see

  1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
  1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
  1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
  1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
  1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)

for every attempt at doing madvise(MADV_COLLAPSE).


I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
using huge pages for executable code that wasn't entirely completely gross.


I don't yet have a standalone repro, but can write one if that's helpful.

Greetings,

Andres Freund


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag
  2023-06-20 20:55   ` Andres Freund
@ 2023-06-20 21:11     ` Peter Xu
  2023-06-20 21:41       ` Andres Freund
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2023-06-20 21:11 UTC (permalink / raw)
  To: Andres Freund
  Cc: David Stevens, linux-mm, Hugh Dickins, Andrew Morton,
	Matthew Wilcox, Kirill A . Shutemov, Yang Shi, David Hildenbrand,
	Jiaqi Yan, linux-kernel

On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote:
> Hi,

Hi, Andres,

> 
> On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> > 
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range.
> > 
> > The present pages are already locked, so replacing them with the locked
> > hpage before the collapse is finalized is unnecessary. However, it is
> > necessary to stop freezing the present pages after validating them,
> > since leaving long-term frozen pages in the page cache can lead to
> > deadlocks. Simply checking the reference count is sufficient to ensure
> > that there are no long-term references hanging around that would the
> > collapse would break. Similar to hpage, there is no reason that the
> > present pages actually need to be frozen in addition to being locked.
> > 
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > a similar race where pages could temporarily disappear from mincore.
> > 
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> 
> I noticed that recently MADV_COLLAPSE stopped being able to collapse a
> binary's executable code, always failing with EAGAIN. I bisected it down to
> a2e17cc2efc7 - this commit.
> 
> Using perf trace -e 'huge_memory:*' -a I see
> 
>   1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>   1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>   1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
>   1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
>   1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> 
> for every attempt at doing madvise(MADV_COLLAPSE).
> 
> 
> I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
> using huge pages for executable code that wasn't entirely completely gross.
> 
> 
> I don't yet have a standalone repro, but can write one if that's helpful.

There's a fix:

https://lore.kernel.org/all/20230607053135.2087354-1-stevensd@google.com/

Already in today's Andrew's pull for rc7:

https://lore.kernel.org/all/20230620123828.813b1140d9c13af900e8edb3@linux-foundation.org/

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag
  2023-06-20 21:11     ` Peter Xu
@ 2023-06-20 21:41       ` Andres Freund
  0 siblings, 0 replies; 10+ messages in thread
From: Andres Freund @ 2023-06-20 21:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Stevens, linux-mm, Hugh Dickins, Andrew Morton,
	Matthew Wilcox, Kirill A . Shutemov, Yang Shi, David Hildenbrand,
	Jiaqi Yan, linux-kernel

Hi,

On 2023-06-20 17:11:30 -0400, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 01:55:47PM -0700, Andres Freund wrote:
> > On 2023-04-04 21:01:17 +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > > 
> > > Make sure that collapse_file doesn't interfere with checking the
> > > uptodate flag in the page cache by only inserting hpage into the page
> > > cache after it has been updated and marked uptodate. This is achieved by
> > > simply not replacing present pages with hpage when iterating over the
> > > target range.
> > > 
> > > The present pages are already locked, so replacing them with the locked
> > > hpage before the collapse is finalized is unnecessary. However, it is
> > > necessary to stop freezing the present pages after validating them,
> > > since leaving long-term frozen pages in the page cache can lead to
> > > deadlocks. Simply checking the reference count is sufficient to ensure
> > > that there are no long-term references hanging around that would the
> > > collapse would break. Similar to hpage, there is no reason that the
> > > present pages actually need to be frozen in addition to being locked.
> > > 
> > > This fixes a race where folio_seek_hole_data would mistake hpage for
> > > an fallocated but unwritten page. This race is visible to userspace via
> > > data temporarily disappearing from SEEK_DATA/SEEK_HOLE. This also fixes
> > > a similar race where pages could temporarily disappear from mincore.
> > > 
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > 
> > I noticed that recently MADV_COLLAPSE stopped being able to collapse a
> > binary's executable code, always failing with EAGAIN. I bisected it down to
> > a2e17cc2efc7 - this commit.
> > 
> > Using perf trace -e 'huge_memory:*' -a I see
> > 
> >   1000.433 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 1537, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.445 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >   1000.485 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2049, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.489 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >   1000.526 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 2561, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.532 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> >   1000.570 postgres.2/1872144 huge_memory:mm_khugepaged_collapse_file(mm: 0xffff889e800bdf00, hpfn: 46720000, index: 3073, is_shmem: 1, filename: "postgres.2", result: 17)
> >   1000.575 postgres.2/1872144 huge_memory:mm_khugepaged_scan_file(mm: 0xffff889e800bdf00, pfn: -1, filename: "postgres.2", present: 512, result: 17)
> > 
> > for every attempt at doing madvise(MADV_COLLAPSE).
> > 
> > 
> > I'm sad about that, because MADV_COLLAPSE was the first thing that allowed
> > using huge pages for executable code that wasn't entirely completely gross.
> > 
> > 
> > I don't yet have a standalone repro, but can write one if that's helpful.
> 
> There's a fix:
> 
> https://lore.kernel.org/all/20230607053135.2087354-1-stevensd@google.com/
> 
> Already in today's Andrew's pull for rc7:
> 
> https://lore.kernel.org/all/20230620123828.813b1140d9c13af900e8edb3@linux-foundation.org/

Ah, great!

I can confirm that the fix unbreaks our use of MADV_COLLAPSE for executable
code...

Greetings,

Andres Freund


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-06-20 21:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-04 12:01 [PATCH v6 0/4] mm/khugepaged: fixes for khugepaged+shmem David Stevens
2023-04-04 12:01 ` [PATCH v6 1/4] mm/khugepaged: drain lru after swapping in shmem David Stevens
2023-04-04 12:01 ` [PATCH v6 2/4] mm/khugepaged: refactor collapse_file control flow David Stevens
2023-04-04 12:01 ` [PATCH v6 3/4] mm/khugepaged: skip shmem with userfaultfd David Stevens
2023-04-04 12:01 ` [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag David Stevens
2023-04-04 21:21   ` Peter Xu
2023-04-19  4:37     ` Hugh Dickins
2023-06-20 20:55   ` Andres Freund
2023-06-20 21:11     ` Peter Xu
2023-06-20 21:41       ` Andres Freund

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).