linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem
@ 2023-02-14  7:57 David Stevens
  2023-02-14  7:57 ` [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd David Stevens
  2023-02-14 15:44 ` [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem Matthew Wilcox
  0 siblings, 2 replies; 16+ messages in thread
From: David Stevens @ 2023-02-14  7:57 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Matthew Wilcox
  Cc: Andrew Morton, Kirill A . Shutemov, Yang Shi, David Hildenbrand,
	Hugh Dickins, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

In collapse_file, mark the THP as up-to-date before inserting it into
the page cache. This fixes a race where folio_seek_hole_data would
mistake the THP for an fallocated but unwritten page. This race is
visible to userspace via data temporarily disappearing from
SEEK_DATA/SEEK_HOLE, which can cause data loss for applications that use
lseek to efficiently snapshot sparse shmem.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: David Stevens <stevensd@chromium.org>
---
 mm/khugepaged.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 79be13133322..b648f1053d95 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1779,10 +1779,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	hpage->mapping = mapping;
 
 	/*
-	 * At this point the hpage is locked and not up-to-date.
-	 * It's safe to insert it into the page cache, because nobody would
-	 * be able to map it or use it in another way until we unlock it.
+	 * Mark hpage as up-to-date before inserting it into the page cache to
+	 * prevent it from being mistaken for an fallocated but unwritten page.
+	 * Inserting the unfinished hpage into the page cache is safe because
+	 * it is locked, so nobody can map it or use it in another way until we
+	 * unlock it.
 	 */
+	SetPageUptodate(hpage);
 
 	xas_set(&xas, start);
 	for (index = start; index < end; index++) {
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-14  7:57 [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem David Stevens
@ 2023-02-14  7:57 ` David Stevens
  2023-02-14 22:35   ` Peter Xu
  2023-02-15 22:48   ` Peter Xu
  2023-02-14 15:44 ` [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem Matthew Wilcox
  1 sibling, 2 replies; 16+ messages in thread
From: David Stevens @ 2023-02-14  7:57 UTC (permalink / raw)
  To: linux-mm, Peter Xu, Matthew Wilcox
  Cc: Andrew Morton, Kirill A . Shutemov, Yang Shi, David Hildenbrand,
	Hugh Dickins, linux-kernel, David Stevens

From: David Stevens <stevensd@chromium.org>

Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING. If userspace has any such userfaultfds registered, then
for any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
collapsing a shmem range would result in replacing an empty page with a
THP, so that it doesn't break userfaultfd.

Synchronization when checking for userfaultfds in collapse_file is
tricky because the mmap locks can't be used to prevent races with the
registration of new userfaultfds. Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds. Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race.
This makes such a race indistinguishable to the collapse occurring
immediately before the userfaultfd registration.

The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache
lock can be dropped during that loop. The second step is to fill the
gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
time, to avoid races with accesses to the page cache that only take the
RCU read lock.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory). There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause
the syscall to fail with EAGAIN.

The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
is exceedingly unlikely that anything relies on being able to observe
that transient state.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b648f1053d95..8c2e2349e883 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -55,6 +55,7 @@ enum scan_result {
 	SCAN_CGROUP_CHARGE_FAIL,
 	SCAN_TRUNCATED,
 	SCAN_PAGE_HAS_PRIVATE,
+	SCAN_PAGE_FILLED,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
  *    + swap/gup in pages if necessary;
- *    + fill in gaps;
  *    + keep old pages around in case rollback is required;
+ *  - finalize updates to the page cache;
  *  - if replacing succeeds:
  *    + copy data over;
  *    + free old pages;
@@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 						result = SCAN_TRUNCATED;
 						goto xa_locked;
 					}
-					xas_set(&xas, index);
+					xas_set(&xas, index + 1);
 				}
 				if (!shmem_charge(mapping->host, 1)) {
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		put_page(page);
 		goto xa_unlocked;
 	}
+
+	if (nr_none) {
+		struct vm_area_struct *vma;
+		int nr_none_check = 0;
+
+		xas_unlock_irq(&xas);
+		i_mmap_lock_read(mapping);
+		xas_lock_irq(&xas);
+
+		xas_set(&xas, start);
+		for (index = start; index < end; index++) {
+			if (!xas_next(&xas)) {
+				xas_store(&xas, XA_RETRY_ENTRY);
+				nr_none_check++;
+			}
+		}
+
+		if (nr_none != nr_none_check) {
+			result = SCAN_PAGE_FILLED;
+			goto immap_locked;
+		}
+
+		/*
+		 * If userspace observed a missing page in a VMA with an armed
+		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
+		 * that page, so we need to roll back to avoid suppressing such
+		 * an event. Any userfaultfds armed after this point will not be
+		 * able to observe any missing pages due to the previously
+		 * inserted retry entries.
+		 */
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
+			if (userfaultfd_missing(vma)) {
+				result = SCAN_EXCEED_NONE_PTE;
+				goto immap_locked;
+			}
+		}
+
+immap_locked:
+		i_mmap_unlock_read(mapping);
+		if (result != SCAN_SUCCEED) {
+			xas_set(&xas, start);
+			for (index = start; index < end; index++) {
+				if (xas_next(&xas) == XA_RETRY_ENTRY)
+					xas_store(&xas, NULL);
+			}
+
+			goto xa_locked;
+		}
+	}
+
 	nr = thp_nr_pages(hpage);
 
 	if (is_shmem)
@@ -2068,15 +2118,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		xas_set(&xas, start);
-		xas_for_each(&xas, page, end - 1) {
+		end = index;
+		for (index = start; index < end; index++) {
+			xas_next(&xas);
 			page = list_first_entry_or_null(&pagelist,
 					struct page, lru);
 			if (!page || xas.xa_index < page->index) {
-				if (!nr_none)
-					break;
 				nr_none--;
-				/* Put holes back where they were */
-				xas_store(&xas, NULL);
 				continue;
 			}
 
@@ -2592,11 +2640,13 @@ static int madvise_collapse_errno(enum scan_result r)
 	case SCAN_ALLOC_HUGE_PAGE_FAIL:
 		return -ENOMEM;
 	case SCAN_CGROUP_CHARGE_FAIL:
+	case SCAN_EXCEED_NONE_PTE:
 		return -EBUSY;
 	/* Resource temporary unavailable - trying again might succeed */
 	case SCAN_PAGE_LOCK:
 	case SCAN_PAGE_LRU:
 	case SCAN_DEL_PAGE_LRU:
+	case SCAN_PAGE_FILLED:
 		return -EAGAIN;
 	/*
 	 * Other: Trying again likely not to succeed / error intrinsic to
-- 
2.39.1.581.gbfd45094c4-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem
  2023-02-14  7:57 [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem David Stevens
  2023-02-14  7:57 ` [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd David Stevens
@ 2023-02-14 15:44 ` Matthew Wilcox
  2023-02-15  1:33   ` David Stevens
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2023-02-14 15:44 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Peter Xu, Andrew Morton, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Hugh Dickins, linux-kernel

On Tue, Feb 14, 2023 at 04:57:09PM +0900, David Stevens wrote:
>  	/*
> -	 * At this point the hpage is locked and not up-to-date.
> -	 * It's safe to insert it into the page cache, because nobody would
> -	 * be able to map it or use it in another way until we unlock it.
> +	 * Mark hpage as up-to-date before inserting it into the page cache to
> +	 * prevent it from being mistaken for an fallocated but unwritten page.
> +	 * Inserting the unfinished hpage into the page cache is safe because
> +	 * it is locked, so nobody can map it or use it in another way until we
> +	 * unlock it.

No, that's not true.  The data has to be there before we mark it
uptodate.  See filemap_get_pages() for example, used as part of
read().  We don't lock the page unless we need to bring it uptodate
ourselves.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-14  7:57 ` [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd David Stevens
@ 2023-02-14 22:35   ` Peter Xu
  2023-02-15  1:57     ` David Stevens
  2023-02-15 22:48   ` Peter Xu
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-02-14 22:35 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

Hi, David,

On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> collapsing a shmem range would result in replacing an empty page with a
> THP, so that it doesn't break userfaultfd.
> 
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race.
> This makes such a race indistinguishable to the collapse occurring
> immediately before the userfaultfd registration.
> 
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
> 
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
> 
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 58 insertions(+), 8 deletions(-)

Could you attach a changelog in your next post (probably with a cover
letter when patches more than one)?

Your patch 1 reminded me that, I think both lseek and mincore will not
report DATA but HOLE on the thp holes during collapse, no matter we fill
hpage in (as long as hpage being !uptodate) or not (as what you do with
this one).

However I don't understand how this new patch can avoid the same race issue
I mentioned in the last version at all.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem
  2023-02-14 15:44 ` [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem Matthew Wilcox
@ 2023-02-15  1:33   ` David Stevens
  2023-02-15 22:05     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: David Stevens @ 2023-02-15  1:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Peter Xu, Andrew Morton, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Hugh Dickins, linux-kernel

On Wed, Feb 15, 2023 at 12:44 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 14, 2023 at 04:57:09PM +0900, David Stevens wrote:
> >       /*
> > -      * At this point the hpage is locked and not up-to-date.
> > -      * It's safe to insert it into the page cache, because nobody would
> > -      * be able to map it or use it in another way until we unlock it.
> > +      * Mark hpage as up-to-date before inserting it into the page cache to
> > +      * prevent it from being mistaken for an fallocated but unwritten page.
> > +      * Inserting the unfinished hpage into the page cache is safe because
> > +      * it is locked, so nobody can map it or use it in another way until we
> > +      * unlock it.
>
> No, that's not true.  The data has to be there before we mark it
> uptodate.  See filemap_get_pages() for example, used as part of
> read().  We don't lock the page unless we need to bring it uptodate
> ourselves.

I've been focusing on the shmem case for collapse_file and forgot to
think about the !is_shmem case. As far as I could tell, shmem doesn't
use filemap_get_pages() and everything else in filemap.c/shmem.c that
checks folio_test_uptodate also locks the folio. But yeah, this would
break the !is_shmem case and is kind of sketchy anyway. I'll put
together a better patch.

-David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-14 22:35   ` Peter Xu
@ 2023-02-15  1:57     ` David Stevens
  2023-02-15 22:27       ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: David Stevens @ 2023-02-15  1:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Wed, Feb 15, 2023 at 7:35 AM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, David,
>
> On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make sure that collapse_file respects any userfaultfds registered with
> > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > for any page which it knows to be missing, it may expect a
> > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > collapsing a shmem range would result in replacing an empty page with a
> > THP, so that it doesn't break userfaultfd.
> >
> > Synchronization when checking for userfaultfds in collapse_file is
> > tricky because the mmap locks can't be used to prevent races with the
> > registration of new userfaultfds. Instead, we provide synchronization by
> > ensuring that userspace cannot observe the fact that pages are missing
> > before we check for userfaultfds. Although this allows registration of a
> > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > observe any pages transition from missing to present after such a race.
> > This makes such a race indistinguishable to the collapse occurring
> > immediately before the userfaultfd registration.
> >
> > The first step to provide this synchronization is to stop filling gaps
> > during the loop iterating over the target range, since the page cache
> > lock can be dropped during that loop. The second step is to fill the
> > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > time, to avoid races with accesses to the page cache that only take the
> > RCU read lock.
> >
> > This fix is targeted at khugepaged, but the change also applies to
> > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > return EBUSY if there are any missing pages (instead of succeeding on
> > shmem and returning EINVAL on anonymous memory). There is also now a
> > window during MADV_COLLAPSE where a fault on a missing page will cause
> > the syscall to fail with EAGAIN.
> >
> > The fact that intermediate page cache state can no longer be observed
> > before the rollback of a failed collapse is also technically a
> > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > is exceedingly unlikely that anything relies on being able to observe
> > that transient state.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 58 insertions(+), 8 deletions(-)
>
> Could you attach a changelog in your next post (probably with a cover
> letter when patches more than one)?
>
> Your patch 1 reminded me that, I think both lseek and mincore will not
> report DATA but HOLE on the thp holes during collapse, no matter we fill
> hpage in (as long as hpage being !uptodate) or not (as what you do with
> this one).
>
> However I don't understand how this new patch can avoid the same race issue
> I mentioned in the last version at all.

If find_get_entry sees an XA_RETRY_ENTRY, then it will re-read from
the xarray. This means find_get_entry will loop while we're finalizing
the collapse - either until we finalize the collapse with the
multi-index hpage entry or abort the collapse and clear the retry
entry. This means that even if userspace registers a userfaultfd and
calls lseek after khugepage check for userfaultfd, the call to lseek
will block until the collapse is finished.

There are a number of other places in filemap.c/shmem.c that do their
own iteration over the xarray, and they all retry on xas_retry() as
well.

-David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem
  2023-02-15  1:33   ` David Stevens
@ 2023-02-15 22:05     ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-02-15 22:05 UTC (permalink / raw)
  To: David Stevens
  Cc: Matthew Wilcox, linux-mm, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Wed, Feb 15, 2023 at 10:33:15AM +0900, David Stevens wrote:
> On Wed, Feb 15, 2023 at 12:44 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Tue, Feb 14, 2023 at 04:57:09PM +0900, David Stevens wrote:
> > >       /*
> > > -      * At this point the hpage is locked and not up-to-date.
> > > -      * It's safe to insert it into the page cache, because nobody would
> > > -      * be able to map it or use it in another way until we unlock it.
> > > +      * Mark hpage as up-to-date before inserting it into the page cache to
> > > +      * prevent it from being mistaken for an fallocated but unwritten page.
> > > +      * Inserting the unfinished hpage into the page cache is safe because
> > > +      * it is locked, so nobody can map it or use it in another way until we
> > > +      * unlock it.
> >
> > No, that's not true.  The data has to be there before we mark it
> > uptodate.  See filemap_get_pages() for example, used as part of
> > read().  We don't lock the page unless we need to bring it uptodate
> > ourselves.
> 
> I've been focusing on the shmem case for collapse_file and forgot to
> think about the !is_shmem case. As far as I could tell, shmem doesn't
> use filemap_get_pages() and everything else in filemap.c/shmem.c that
> checks folio_test_uptodate also locks the folio. But yeah, this would
> break the !is_shmem case and is kind of sketchy anyway. I'll put
> together a better patch.

AFAIU we lock the page iff !uptodate and we want to wait it to be uptodate,
or as Matthew said when we want to modify !uptodate->uptodate.

Take the same example of folio_seek_hole_data() that you mentioned:

	if (xa_is_value(folio) || folio_test_uptodate(folio))
		return seek_data ? start : end;

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-15  1:57     ` David Stevens
@ 2023-02-15 22:27       ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2023-02-15 22:27 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Wed, Feb 15, 2023 at 10:57:11AM +0900, David Stevens wrote:
> On Wed, Feb 15, 2023 at 7:35 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, David,
> >
> > On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make sure that collapse_file respects any userfaultfds registered with
> > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > for any page which it knows to be missing, it may expect a
> > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > > collapsing a shmem range would result in replacing an empty page with a
> > > THP, so that it doesn't break userfaultfd.
> > >
> > > Synchronization when checking for userfaultfds in collapse_file is
> > > tricky because the mmap locks can't be used to prevent races with the
> > > registration of new userfaultfds. Instead, we provide synchronization by
> > > ensuring that userspace cannot observe the fact that pages are missing
> > > before we check for userfaultfds. Although this allows registration of a
> > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > observe any pages transition from missing to present after such a race.
> > > This makes such a race indistinguishable to the collapse occurring
> > > immediately before the userfaultfd registration.
> > >
> > > The first step to provide this synchronization is to stop filling gaps
> > > during the loop iterating over the target range, since the page cache
> > > lock can be dropped during that loop. The second step is to fill the
> > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > time, to avoid races with accesses to the page cache that only take the
> > > RCU read lock.
> > >
> > > This fix is targeted at khugepaged, but the change also applies to
> > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > return EBUSY if there are any missing pages (instead of succeeding on
> > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > the syscall to fail with EAGAIN.
> > >
> > > The fact that intermediate page cache state can no longer be observed
> > > before the rollback of a failed collapse is also technically a
> > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > is exceedingly unlikely that anything relies on being able to observe
> > > that transient state.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 58 insertions(+), 8 deletions(-)
> >
> > Could you attach a changelog in your next post (probably with a cover
> > letter when patches more than one)?
> >
> > Your patch 1 reminded me that, I think both lseek and mincore will not
> > report DATA but HOLE on the thp holes during collapse, no matter we fill
> > hpage in (as long as hpage being !uptodate) or not (as what you do with
> > this one).
> >
> > However I don't understand how this new patch can avoid the same race issue
> > I mentioned in the last version at all.
> 
> If find_get_entry sees an XA_RETRY_ENTRY, then it will re-read from
> the xarray. This means find_get_entry will loop while we're finalizing
> the collapse - either until we finalize the collapse with the
> multi-index hpage entry or abort the collapse and clear the retry
> entry. This means that even if userspace registers a userfaultfd and
> calls lseek after khugepage check for userfaultfd, the call to lseek
> will block until the collapse is finished.
> 
> There are a number of other places in filemap.c/shmem.c that do their
> own iteration over the xarray, and they all retry on xas_retry() as
> well.

I've no problem on using RETRY entries (as long as others are fine with it
:).  It seems your logic depends on patch 1 being there already, so right
after the RETRY got replaced with the thp it'll show Uptodate==DATA.

However I doubt whether patch 1 is correct at all..  Maybe that can be
instead fixed by having:

		folio_mark_uptodate(folio);

To be before:

		xas_set_order(&xas, start, HPAGE_PMD_ORDER);
		xas_store(&xas, hpage);

To replace patch 1, but I think there's still some issue in patch 2 even if
it works.  Ouch, I cut the codes..  I'll comment inline in another reply.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-14  7:57 ` [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd David Stevens
  2023-02-14 22:35   ` Peter Xu
@ 2023-02-15 22:48   ` Peter Xu
  2023-02-16  1:37     ` David Stevens
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-02-15 22:48 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> collapsing a shmem range would result in replacing an empty page with a
> THP, so that it doesn't break userfaultfd.
> 
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race.
> This makes such a race indistinguishable to the collapse occurring
> immediately before the userfaultfd registration.
> 
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
> 
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
> 
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b648f1053d95..8c2e2349e883 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -55,6 +55,7 @@ enum scan_result {
>  	SCAN_CGROUP_CHARGE_FAIL,
>  	SCAN_TRUNCATED,
>  	SCAN_PAGE_HAS_PRIVATE,
> +	SCAN_PAGE_FILLED,

PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.

>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>   *  - allocate and lock a new huge page;
>   *  - scan page cache replacing old pages with the new one
>   *    + swap/gup in pages if necessary;
> - *    + fill in gaps;

IIUC it's not a complete removal, but just moved downwards:

>   *    + keep old pages around in case rollback is required;
> + *  - finalize updates to the page cache;

         + fill in gaps with RETRY entries
         + detect race conditions with userfaultfds

>   *  - if replacing succeeds:
>   *    + copy data over;
>   *    + free old pages;
> @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  						result = SCAN_TRUNCATED;
>  						goto xa_locked;
>  					}
> -					xas_set(&xas, index);
> +					xas_set(&xas, index + 1);
>  				}
>  				if (!shmem_charge(mapping->host, 1)) {
>  					result = SCAN_FAIL;
>  					goto xa_locked;
>  				}
> -				xas_store(&xas, hpage);
>  				nr_none++;
>  				continue;
>  			}
> @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		put_page(page);
>  		goto xa_unlocked;
>  	}
> +
> +	if (nr_none) {
> +		struct vm_area_struct *vma;
> +		int nr_none_check = 0;
> +
> +		xas_unlock_irq(&xas);
> +		i_mmap_lock_read(mapping);
> +		xas_lock_irq(&xas);
> +
> +		xas_set(&xas, start);
> +		for (index = start; index < end; index++) {
> +			if (!xas_next(&xas)) {
> +				xas_store(&xas, XA_RETRY_ENTRY);
> +				nr_none_check++;
> +			}
> +		}
> +
> +		if (nr_none != nr_none_check) {
> +			result = SCAN_PAGE_FILLED;
> +			goto immap_locked;
> +		}
> +
> +		/*
> +		 * If userspace observed a missing page in a VMA with an armed
> +		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> +		 * that page, so we need to roll back to avoid suppressing such
> +		 * an event. Any userfaultfds armed after this point will not be
> +		 * able to observe any missing pages due to the previously
> +		 * inserted retry entries.
> +		 */
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> +			if (userfaultfd_missing(vma)) {
> +				result = SCAN_EXCEED_NONE_PTE;
> +				goto immap_locked;
> +			}
> +		}
> +
> +immap_locked:
> +		i_mmap_unlock_read(mapping);
> +		if (result != SCAN_SUCCEED) {
> +			xas_set(&xas, start);
> +			for (index = start; index < end; index++) {
> +				if (xas_next(&xas) == XA_RETRY_ENTRY)
> +					xas_store(&xas, NULL);
> +			}
> +
> +			goto xa_locked;
> +		}
> +	}
> +

Until here, all look fine to me (ignoring patch 1 for now; assuming the
hpage is always uptodate).

My question is after here we'll release page cache lock again before
try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
page cache lock?  It means other threads can be spinning.  I assume page
lock is always safe and sleepable, but not sure about the page cache lock
here.

>  	nr = thp_nr_pages(hpage);
>  
>  	if (is_shmem)
> @@ -2068,15 +2118,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  
>  		xas_set(&xas, start);
> -		xas_for_each(&xas, page, end - 1) {
> +		end = index;
> +		for (index = start; index < end; index++) {
> +			xas_next(&xas);
>  			page = list_first_entry_or_null(&pagelist,
>  					struct page, lru);
>  			if (!page || xas.xa_index < page->index) {
> -				if (!nr_none)
> -					break;
>  				nr_none--;
> -				/* Put holes back where they were */
> -				xas_store(&xas, NULL);
>  				continue;
>  			}
>  
> @@ -2592,11 +2640,13 @@ static int madvise_collapse_errno(enum scan_result r)
>  	case SCAN_ALLOC_HUGE_PAGE_FAIL:
>  		return -ENOMEM;
>  	case SCAN_CGROUP_CHARGE_FAIL:
> +	case SCAN_EXCEED_NONE_PTE:
>  		return -EBUSY;
>  	/* Resource temporary unavailable - trying again might succeed */
>  	case SCAN_PAGE_LOCK:
>  	case SCAN_PAGE_LRU:
>  	case SCAN_DEL_PAGE_LRU:
> +	case SCAN_PAGE_FILLED:
>  		return -EAGAIN;
>  	/*
>  	 * Other: Trying again likely not to succeed / error intrinsic to
> -- 
> 2.39.1.581.gbfd45094c4-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-15 22:48   ` Peter Xu
@ 2023-02-16  1:37     ` David Stevens
  2023-02-16 14:41       ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: David Stevens @ 2023-02-16  1:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Thu, Feb 16, 2023 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make sure that collapse_file respects any userfaultfds registered with
> > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > for any page which it knows to be missing, it may expect a
> > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > collapsing a shmem range would result in replacing an empty page with a
> > THP, so that it doesn't break userfaultfd.
> >
> > Synchronization when checking for userfaultfds in collapse_file is
> > tricky because the mmap locks can't be used to prevent races with the
> > registration of new userfaultfds. Instead, we provide synchronization by
> > ensuring that userspace cannot observe the fact that pages are missing
> > before we check for userfaultfds. Although this allows registration of a
> > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > observe any pages transition from missing to present after such a race.
> > This makes such a race indistinguishable to the collapse occurring
> > immediately before the userfaultfd registration.
> >
> > The first step to provide this synchronization is to stop filling gaps
> > during the loop iterating over the target range, since the page cache
> > lock can be dropped during that loop. The second step is to fill the
> > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > time, to avoid races with accesses to the page cache that only take the
> > RCU read lock.
> >
> > This fix is targeted at khugepaged, but the change also applies to
> > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > return EBUSY if there are any missing pages (instead of succeeding on
> > shmem and returning EINVAL on anonymous memory). There is also now a
> > window during MADV_COLLAPSE where a fault on a missing page will cause
> > the syscall to fail with EAGAIN.
> >
> > The fact that intermediate page cache state can no longer be observed
> > before the rollback of a failed collapse is also technically a
> > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > is exceedingly unlikely that anything relies on being able to observe
> > that transient state.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > ---
> >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 58 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b648f1053d95..8c2e2349e883 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -55,6 +55,7 @@ enum scan_result {
> >       SCAN_CGROUP_CHARGE_FAIL,
> >       SCAN_TRUNCATED,
> >       SCAN_PAGE_HAS_PRIVATE,
> > +     SCAN_PAGE_FILLED,
>
> PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.
>
> >  };
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >   *  - allocate and lock a new huge page;
> >   *  - scan page cache replacing old pages with the new one
> >   *    + swap/gup in pages if necessary;
> > - *    + fill in gaps;
>
> IIUC it's not a complete removal, but just moved downwards:
>
> >   *    + keep old pages around in case rollback is required;
> > + *  - finalize updates to the page cache;
>
>          + fill in gaps with RETRY entries
>          + detect race conditions with userfaultfds
>
> >   *  - if replacing succeeds:
> >   *    + copy data over;
> >   *    + free old pages;
> > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >                                               result = SCAN_TRUNCATED;
> >                                               goto xa_locked;
> >                                       }
> > -                                     xas_set(&xas, index);
> > +                                     xas_set(&xas, index + 1);
> >                               }
> >                               if (!shmem_charge(mapping->host, 1)) {
> >                                       result = SCAN_FAIL;
> >                                       goto xa_locked;
> >                               }
> > -                             xas_store(&xas, hpage);
> >                               nr_none++;
> >                               continue;
> >                       }
> > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >               put_page(page);
> >               goto xa_unlocked;
> >       }
> > +
> > +     if (nr_none) {
> > +             struct vm_area_struct *vma;
> > +             int nr_none_check = 0;
> > +
> > +             xas_unlock_irq(&xas);
> > +             i_mmap_lock_read(mapping);
> > +             xas_lock_irq(&xas);
> > +
> > +             xas_set(&xas, start);
> > +             for (index = start; index < end; index++) {
> > +                     if (!xas_next(&xas)) {
> > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > +                             nr_none_check++;
> > +                     }
> > +             }
> > +
> > +             if (nr_none != nr_none_check) {
> > +                     result = SCAN_PAGE_FILLED;
> > +                     goto immap_locked;
> > +             }
> > +
> > +             /*
> > +              * If userspace observed a missing page in a VMA with an armed
> > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > +              * that page, so we need to roll back to avoid suppressing such
> > +              * an event. Any userfaultfds armed after this point will not be
> > +              * able to observe any missing pages due to the previously
> > +              * inserted retry entries.
> > +              */
> > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > +                     if (userfaultfd_missing(vma)) {
> > +                             result = SCAN_EXCEED_NONE_PTE;
> > +                             goto immap_locked;
> > +                     }
> > +             }
> > +
> > +immap_locked:
> > +             i_mmap_unlock_read(mapping);
> > +             if (result != SCAN_SUCCEED) {
> > +                     xas_set(&xas, start);
> > +                     for (index = start; index < end; index++) {
> > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > +                                     xas_store(&xas, NULL);
> > +                     }
> > +
> > +                     goto xa_locked;
> > +             }
> > +     }
> > +
>
> Until here, all look fine to me (ignoring patch 1 for now; assuming the
> hpage is always uptodate).
>
> My question is after here we'll release page cache lock again before
> try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
> page cache lock?  It means other threads can be spinning.  I assume page
> lock is always safe and sleepable, but not sure about the page cache lock
> here.

We insert the multi-index entry for hpage before releasing the page
cache lock, which should replace all of the XA_RETRY_ENTRYs. So the
page cache will be fully up to date when we release the lock, at least
in terms of which pages it contains.

-David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-16  1:37     ` David Stevens
@ 2023-02-16 14:41       ` Peter Xu
  2023-02-16 21:58         ` Yang Shi
  2023-02-17  2:00         ` David Stevens
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Xu @ 2023-02-16 14:41 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Thu, Feb 16, 2023 at 10:37:47AM +0900, David Stevens wrote:
> On Thu, Feb 16, 2023 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make sure that collapse_file respects any userfaultfds registered with
> > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > for any page which it knows to be missing, it may expect a
> > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > > collapsing a shmem range would result in replacing an empty page with a
> > > THP, so that it doesn't break userfaultfd.
> > >
> > > Synchronization when checking for userfaultfds in collapse_file is
> > > tricky because the mmap locks can't be used to prevent races with the
> > > registration of new userfaultfds. Instead, we provide synchronization by
> > > ensuring that userspace cannot observe the fact that pages are missing
> > > before we check for userfaultfds. Although this allows registration of a
> > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > observe any pages transition from missing to present after such a race.
> > > This makes such a race indistinguishable to the collapse occurring
> > > immediately before the userfaultfd registration.
> > >
> > > The first step to provide this synchronization is to stop filling gaps
> > > during the loop iterating over the target range, since the page cache
> > > lock can be dropped during that loop. The second step is to fill the
> > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > time, to avoid races with accesses to the page cache that only take the
> > > RCU read lock.
> > >
> > > This fix is targeted at khugepaged, but the change also applies to
> > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > return EBUSY if there are any missing pages (instead of succeeding on
> > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > the syscall to fail with EAGAIN.
> > >
> > > The fact that intermediate page cache state can no longer be observed
> > > before the rollback of a failed collapse is also technically a
> > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > is exceedingly unlikely that anything relies on being able to observe
> > > that transient state.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > ---
> > >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 58 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index b648f1053d95..8c2e2349e883 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -55,6 +55,7 @@ enum scan_result {
> > >       SCAN_CGROUP_CHARGE_FAIL,
> > >       SCAN_TRUNCATED,
> > >       SCAN_PAGE_HAS_PRIVATE,
> > > +     SCAN_PAGE_FILLED,
> >
> > PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.
> >
> > >  };
> > >
> > >  #define CREATE_TRACE_POINTS
> > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > >   *  - allocate and lock a new huge page;
> > >   *  - scan page cache replacing old pages with the new one
> > >   *    + swap/gup in pages if necessary;
> > > - *    + fill in gaps;
> >
> > IIUC it's not a complete removal, but just moved downwards:
> >
> > >   *    + keep old pages around in case rollback is required;
> > > + *  - finalize updates to the page cache;
> >
> >          + fill in gaps with RETRY entries
> >          + detect race conditions with userfaultfds
> >
> > >   *  - if replacing succeeds:
> > >   *    + copy data over;
> > >   *    + free old pages;
> > > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >                                               result = SCAN_TRUNCATED;
> > >                                               goto xa_locked;
> > >                                       }
> > > -                                     xas_set(&xas, index);
> > > +                                     xas_set(&xas, index + 1);
> > >                               }
> > >                               if (!shmem_charge(mapping->host, 1)) {
> > >                                       result = SCAN_FAIL;
> > >                                       goto xa_locked;
> > >                               }
> > > -                             xas_store(&xas, hpage);
> > >                               nr_none++;
> > >                               continue;
> > >                       }
> > > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > >               put_page(page);
> > >               goto xa_unlocked;
> > >       }
> > > +
> > > +     if (nr_none) {
> > > +             struct vm_area_struct *vma;
> > > +             int nr_none_check = 0;
> > > +
> > > +             xas_unlock_irq(&xas);
> > > +             i_mmap_lock_read(mapping);
> > > +             xas_lock_irq(&xas);
> > > +
> > > +             xas_set(&xas, start);
> > > +             for (index = start; index < end; index++) {
> > > +                     if (!xas_next(&xas)) {
> > > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > > +                             nr_none_check++;
> > > +                     }
> > > +             }
> > > +
> > > +             if (nr_none != nr_none_check) {
> > > +                     result = SCAN_PAGE_FILLED;
> > > +                     goto immap_locked;
> > > +             }
> > > +
> > > +             /*
> > > +              * If userspace observed a missing page in a VMA with an armed
> > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > +              * that page, so we need to roll back to avoid suppressing such
> > > +              * an event. Any userfaultfds armed after this point will not be
> > > +              * able to observe any missing pages due to the previously
> > > +              * inserted retry entries.
> > > +              */
> > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > +                     if (userfaultfd_missing(vma)) {
> > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > +                             goto immap_locked;
> > > +                     }
> > > +             }
> > > +
> > > +immap_locked:
> > > +             i_mmap_unlock_read(mapping);
> > > +             if (result != SCAN_SUCCEED) {
> > > +                     xas_set(&xas, start);
> > > +                     for (index = start; index < end; index++) {
> > > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > > +                                     xas_store(&xas, NULL);
> > > +                     }
> > > +
> > > +                     goto xa_locked;
> > > +             }
> > > +     }
> > > +
> >
> > Until here, all look fine to me (ignoring patch 1 for now; assuming the
> > hpage is always uptodate).
> >
> > My question is after here we'll release page cache lock again before
> > try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
> > page cache lock?  It means other threads can be spinning.  I assume page
> > lock is always safe and sleepable, but not sure about the page cache lock
> > here.
> 
> We insert the multi-index entry for hpage before releasing the page
> cache lock, which should replace all of the XA_RETRY_ENTRYs. So the
> page cache will be fully up to date when we release the lock, at least
> in terms of which pages it contains.

IIUC we released it before copying the pages:

xa_locked:
	xas_unlock_irq(&xas);  <-------------------------------- here
xa_unlocked:

	/*
	 * If collapse is successful, flush must be done now before copying.
	 * If collapse is unsuccessful, does flush actually need to be done?
	 * Do it anyway, to clear the state.
	 */
	try_to_unmap_flush();

Before insertion of the multi-index:

        /* Join all the small entries into a single multi-index entry. */
        xas_set_order(&xas, start, HPAGE_PMD_ORDER);
        xas_store(&xas, hpage);

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-16 14:41       ` Peter Xu
@ 2023-02-16 21:58         ` Yang Shi
  2023-02-16 23:07           ` Peter Xu
  2023-02-17  2:00         ` David Stevens
  1 sibling, 1 reply; 16+ messages in thread
From: Yang Shi @ 2023-02-16 21:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Stevens, linux-mm, Matthew Wilcox, Andrew Morton,
	Kirill A . Shutemov, David Hildenbrand, Hugh Dickins,
	linux-kernel

On Thu, Feb 16, 2023 at 6:41 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 16, 2023 at 10:37:47AM +0900, David Stevens wrote:
> > On Thu, Feb 16, 2023 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > Make sure that collapse_file respects any userfaultfds registered with
> > > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > > for any page which it knows to be missing, it may expect a
> > > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > > > collapsing a shmem range would result in replacing an empty page with a
> > > > THP, so that it doesn't break userfaultfd.
> > > >
> > > > Synchronization when checking for userfaultfds in collapse_file is
> > > > tricky because the mmap locks can't be used to prevent races with the
> > > > registration of new userfaultfds. Instead, we provide synchronization by
> > > > ensuring that userspace cannot observe the fact that pages are missing
> > > > before we check for userfaultfds. Although this allows registration of a
> > > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > > observe any pages transition from missing to present after such a race.
> > > > This makes such a race indistinguishable to the collapse occurring
> > > > immediately before the userfaultfd registration.
> > > >
> > > > The first step to provide this synchronization is to stop filling gaps
> > > > during the loop iterating over the target range, since the page cache
> > > > lock can be dropped during that loop. The second step is to fill the
> > > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > > time, to avoid races with accesses to the page cache that only take the
> > > > RCU read lock.
> > > >
> > > > This fix is targeted at khugepaged, but the change also applies to
> > > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > > return EBUSY if there are any missing pages (instead of succeeding on
> > > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > > the syscall to fail with EAGAIN.
> > > >
> > > > The fact that intermediate page cache state can no longer be observed
> > > > before the rollback of a failed collapse is also technically a
> > > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > > is exceedingly unlikely that anything relies on being able to observe
> > > > that transient state.
> > > >
> > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > ---
> > > >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> > > >  1 file changed, 58 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index b648f1053d95..8c2e2349e883 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -55,6 +55,7 @@ enum scan_result {
> > > >       SCAN_CGROUP_CHARGE_FAIL,
> > > >       SCAN_TRUNCATED,
> > > >       SCAN_PAGE_HAS_PRIVATE,
> > > > +     SCAN_PAGE_FILLED,
> > >
> > > PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.
> > >
> > > >  };
> > > >
> > > >  #define CREATE_TRACE_POINTS
> > > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > >   *  - allocate and lock a new huge page;
> > > >   *  - scan page cache replacing old pages with the new one
> > > >   *    + swap/gup in pages if necessary;
> > > > - *    + fill in gaps;
> > >
> > > IIUC it's not a complete removal, but just moved downwards:
> > >
> > > >   *    + keep old pages around in case rollback is required;
> > > > + *  - finalize updates to the page cache;
> > >
> > >          + fill in gaps with RETRY entries
> > >          + detect race conditions with userfaultfds
> > >
> > > >   *  - if replacing succeeds:
> > > >   *    + copy data over;
> > > >   *    + free old pages;
> > > > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >                                               result = SCAN_TRUNCATED;
> > > >                                               goto xa_locked;
> > > >                                       }
> > > > -                                     xas_set(&xas, index);
> > > > +                                     xas_set(&xas, index + 1);
> > > >                               }
> > > >                               if (!shmem_charge(mapping->host, 1)) {
> > > >                                       result = SCAN_FAIL;
> > > >                                       goto xa_locked;
> > > >                               }
> > > > -                             xas_store(&xas, hpage);
> > > >                               nr_none++;
> > > >                               continue;
> > > >                       }
> > > > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >               put_page(page);
> > > >               goto xa_unlocked;
> > > >       }
> > > > +
> > > > +     if (nr_none) {
> > > > +             struct vm_area_struct *vma;
> > > > +             int nr_none_check = 0;
> > > > +
> > > > +             xas_unlock_irq(&xas);
> > > > +             i_mmap_lock_read(mapping);
> > > > +             xas_lock_irq(&xas);
> > > > +
> > > > +             xas_set(&xas, start);
> > > > +             for (index = start; index < end; index++) {
> > > > +                     if (!xas_next(&xas)) {
> > > > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > > > +                             nr_none_check++;
> > > > +                     }
> > > > +             }
> > > > +
> > > > +             if (nr_none != nr_none_check) {
> > > > +                     result = SCAN_PAGE_FILLED;
> > > > +                     goto immap_locked;
> > > > +             }
> > > > +
> > > > +             /*
> > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > +              * able to observe any missing pages due to the previously
> > > > +              * inserted retry entries.
> > > > +              */
> > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > +                     if (userfaultfd_missing(vma)) {
> > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > +                             goto immap_locked;
> > > > +                     }
> > > > +             }
> > > > +
> > > > +immap_locked:
> > > > +             i_mmap_unlock_read(mapping);
> > > > +             if (result != SCAN_SUCCEED) {
> > > > +                     xas_set(&xas, start);
> > > > +                     for (index = start; index < end; index++) {
> > > > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > > > +                                     xas_store(&xas, NULL);
> > > > +                     }
> > > > +
> > > > +                     goto xa_locked;
> > > > +             }
> > > > +     }
> > > > +
> > >
> > > Until here, all look fine to me (ignoring patch 1 for now; assuming the
> > > hpage is always uptodate).
> > >
> > > My question is after here we'll release page cache lock again before
> > > try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
> > > page cache lock?  It means other threads can be spinning.  I assume page
> > > lock is always safe and sleepable, but not sure about the page cache lock
> > > here.
> >
> > We insert the multi-index entry for hpage before releasing the page
> > cache lock, which should replace all of the XA_RETRY_ENTRYs. So the
> > page cache will be fully up to date when we release the lock, at least
> > in terms of which pages it contains.
>
> IIUC we released it before copying the pages:

The huge page is locked until the copy is done. It should be fine
unless the users inspect the page content without acquiring page lock.

>
> xa_locked:
>         xas_unlock_irq(&xas);  <-------------------------------- here
> xa_unlocked:
>
>         /*
>          * If collapse is successful, flush must be done now before copying.
>          * If collapse is unsuccessful, does flush actually need to be done?
>          * Do it anyway, to clear the state.
>          */
>         try_to_unmap_flush();
>
> Before insertion of the multi-index:
>
>         /* Join all the small entries into a single multi-index entry. */
>         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
>         xas_store(&xas, hpage);
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-16 21:58         ` Yang Shi
@ 2023-02-16 23:07           ` Peter Xu
  2023-02-16 23:52             ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2023-02-16 23:07 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Stevens, linux-mm, Matthew Wilcox, Andrew Morton,
	Kirill A . Shutemov, David Hildenbrand, Hugh Dickins,
	linux-kernel

Hi, Yang,

On Thu, Feb 16, 2023 at 01:58:55PM -0800, Yang Shi wrote:
> > IIUC we released it before copying the pages:
> 
> The huge page is locked until the copy is done. It should be fine
> unless the users inspect the page content without acquiring page lock.

The current patch from David has replaced "insert hpage into holes" with
"insert RETRY entries into holes", so IMHO the hpage is not visible at all
when releasing page cache lock here.

All the accessors (including RCU protected ones to access page cache; those
may not need to take the page lock) should be spinning on the RETRY entry,
which it seems fine to me.  But my question was whether it's legal to keep
them spinning even after releasing the page cache lock.

Thanks,

> 
> >
> > xa_locked:
> >         xas_unlock_irq(&xas);  <-------------------------------- here
> > xa_unlocked:
> >
> >         /*
> >          * If collapse is successful, flush must be done now before copying.
> >          * If collapse is unsuccessful, does flush actually need to be done?
> >          * Do it anyway, to clear the state.
> >          */
> >         try_to_unmap_flush();
> >
> > Before insertion of the multi-index:
> >
> >         /* Join all the small entries into a single multi-index entry. */
> >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> >         xas_store(&xas, hpage);
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-16 23:07           ` Peter Xu
@ 2023-02-16 23:52             ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2023-02-16 23:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Stevens, linux-mm, Matthew Wilcox, Andrew Morton,
	Kirill A . Shutemov, David Hildenbrand, Hugh Dickins,
	linux-kernel

On Thu, Feb 16, 2023 at 3:07 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Yang,
>
> On Thu, Feb 16, 2023 at 01:58:55PM -0800, Yang Shi wrote:
> > > IIUC we released it before copying the pages:
> >
> > The huge page is locked until the copy is done. It should be fine
> > unless the users inspect the page content without acquiring page lock.
>
> The current patch from David has replaced "insert hpage into holes" with
> "insert RETRY entries into holes", so IMHO the hpage is not visible at all
> when releasing page cache lock here.

IIRC his patch (just this patch, don't include patch #1) conceptually does:

acquire xa lock

fill the holes with retry entry
if (nr_none == nr_none_check && uffd missing pass) /* no hole is
filled since holding xa_lock and no uffd missing */
    install huge page in page cache <-- huge page is visible here
else {
    set error code
    replace retry entry back to NULL
}

release xa_lock

if (succeed) {
    copy content to huge page
    unlock huge page
} else
    restore the small pages


Am I missing something?

>
> All the accessors (including RCU protected ones to access page cache; those
> may not need to take the page lock) should be spinning on the RETRY entry,
> which it seems fine to me.  But my question was whether it's legal to keep
> them spinning even after releasing the page cache lock.

After releasing the page cache lock, they should see NULL entry or
huge page IIUC.

>
> Thanks,
>
> >
> > >
> > > xa_locked:
> > >         xas_unlock_irq(&xas);  <-------------------------------- here
> > > xa_unlocked:
> > >
> > >         /*
> > >          * If collapse is successful, flush must be done now before copying.
> > >          * If collapse is unsuccessful, does flush actually need to be done?
> > >          * Do it anyway, to clear the state.
> > >          */
> > >         try_to_unmap_flush();
> > >
> > > Before insertion of the multi-index:
> > >
> > >         /* Join all the small entries into a single multi-index entry. */
> > >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> > >         xas_store(&xas, hpage);
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-16 14:41       ` Peter Xu
  2023-02-16 21:58         ` Yang Shi
@ 2023-02-17  2:00         ` David Stevens
  2023-02-17  3:20           ` Yang Shi
  1 sibling, 1 reply; 16+ messages in thread
From: David Stevens @ 2023-02-17  2:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, Matthew Wilcox, Andrew Morton, Kirill A . Shutemov,
	Yang Shi, David Hildenbrand, Hugh Dickins, linux-kernel

On Thu, Feb 16, 2023 at 11:41 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 16, 2023 at 10:37:47AM +0900, David Stevens wrote:
> > On Thu, Feb 16, 2023 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > > > From: David Stevens <stevensd@chromium.org>
> > > >
> > > > Make sure that collapse_file respects any userfaultfds registered with
> > > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > > for any page which it knows to be missing, it may expect a
> > > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > > > collapsing a shmem range would result in replacing an empty page with a
> > > > THP, so that it doesn't break userfaultfd.
> > > >
> > > > Synchronization when checking for userfaultfds in collapse_file is
> > > > tricky because the mmap locks can't be used to prevent races with the
> > > > registration of new userfaultfds. Instead, we provide synchronization by
> > > > ensuring that userspace cannot observe the fact that pages are missing
> > > > before we check for userfaultfds. Although this allows registration of a
> > > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > > observe any pages transition from missing to present after such a race.
> > > > This makes such a race indistinguishable to the collapse occurring
> > > > immediately before the userfaultfd registration.
> > > >
> > > > The first step to provide this synchronization is to stop filling gaps
> > > > during the loop iterating over the target range, since the page cache
> > > > lock can be dropped during that loop. The second step is to fill the
> > > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > > time, to avoid races with accesses to the page cache that only take the
> > > > RCU read lock.
> > > >
> > > > This fix is targeted at khugepaged, but the change also applies to
> > > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > > return EBUSY if there are any missing pages (instead of succeeding on
> > > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > > the syscall to fail with EAGAIN.
> > > >
> > > > The fact that intermediate page cache state can no longer be observed
> > > > before the rollback of a failed collapse is also technically a
> > > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > > is exceedingly unlikely that anything relies on being able to observe
> > > > that transient state.
> > > >
> > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > ---
> > > >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> > > >  1 file changed, 58 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index b648f1053d95..8c2e2349e883 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -55,6 +55,7 @@ enum scan_result {
> > > >       SCAN_CGROUP_CHARGE_FAIL,
> > > >       SCAN_TRUNCATED,
> > > >       SCAN_PAGE_HAS_PRIVATE,
> > > > +     SCAN_PAGE_FILLED,
> > >
> > > PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.
> > >
> > > >  };
> > > >
> > > >  #define CREATE_TRACE_POINTS
> > > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > >   *  - allocate and lock a new huge page;
> > > >   *  - scan page cache replacing old pages with the new one
> > > >   *    + swap/gup in pages if necessary;
> > > > - *    + fill in gaps;
> > >
> > > IIUC it's not a complete removal, but just moved downwards:
> > >
> > > >   *    + keep old pages around in case rollback is required;
> > > > + *  - finalize updates to the page cache;
> > >
> > >          + fill in gaps with RETRY entries
> > >          + detect race conditions with userfaultfds
> > >
> > > >   *  - if replacing succeeds:
> > > >   *    + copy data over;
> > > >   *    + free old pages;
> > > > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >                                               result = SCAN_TRUNCATED;
> > > >                                               goto xa_locked;
> > > >                                       }
> > > > -                                     xas_set(&xas, index);
> > > > +                                     xas_set(&xas, index + 1);
> > > >                               }
> > > >                               if (!shmem_charge(mapping->host, 1)) {
> > > >                                       result = SCAN_FAIL;
> > > >                                       goto xa_locked;
> > > >                               }
> > > > -                             xas_store(&xas, hpage);
> > > >                               nr_none++;
> > > >                               continue;
> > > >                       }
> > > > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > >               put_page(page);
> > > >               goto xa_unlocked;
> > > >       }
> > > > +
> > > > +     if (nr_none) {
> > > > +             struct vm_area_struct *vma;
> > > > +             int nr_none_check = 0;
> > > > +
> > > > +             xas_unlock_irq(&xas);
> > > > +             i_mmap_lock_read(mapping);
> > > > +             xas_lock_irq(&xas);
> > > > +
> > > > +             xas_set(&xas, start);
> > > > +             for (index = start; index < end; index++) {
> > > > +                     if (!xas_next(&xas)) {
> > > > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > > > +                             nr_none_check++;
> > > > +                     }
> > > > +             }
> > > > +
> > > > +             if (nr_none != nr_none_check) {
> > > > +                     result = SCAN_PAGE_FILLED;
> > > > +                     goto immap_locked;
> > > > +             }
> > > > +
> > > > +             /*
> > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > +              * able to observe any missing pages due to the previously
> > > > +              * inserted retry entries.
> > > > +              */
> > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > +                     if (userfaultfd_missing(vma)) {
> > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > +                             goto immap_locked;
> > > > +                     }
> > > > +             }
> > > > +
> > > > +immap_locked:
> > > > +             i_mmap_unlock_read(mapping);
> > > > +             if (result != SCAN_SUCCEED) {
> > > > +                     xas_set(&xas, start);
> > > > +                     for (index = start; index < end; index++) {
> > > > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > > > +                                     xas_store(&xas, NULL);
> > > > +                     }
> > > > +
> > > > +                     goto xa_locked;
> > > > +             }
> > > > +     }
> > > > +
> > >
> > > Until here, all look fine to me (ignoring patch 1 for now; assuming the
> > > hpage is always uptodate).
> > >
> > > My question is after here we'll release page cache lock again before
> > > try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
> > > page cache lock?  It means other threads can be spinning.  I assume page
> > > lock is always safe and sleepable, but not sure about the page cache lock
> > > here.
> >
> > We insert the multi-index entry for hpage before releasing the page
> > cache lock, which should replace all of the XA_RETRY_ENTRYs. So the
> > page cache will be fully up to date when we release the lock, at least
> > in terms of which pages it contains.
>
> IIUC we released it before copying the pages:
>
> xa_locked:
>         xas_unlock_irq(&xas);  <-------------------------------- here
> xa_unlocked:
>
>         /*
>          * If collapse is successful, flush must be done now before copying.
>          * If collapse is unsuccessful, does flush actually need to be done?
>          * Do it anyway, to clear the state.
>          */
>         try_to_unmap_flush();
>
> Before insertion of the multi-index:
>
>         /* Join all the small entries into a single multi-index entry. */
>         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
>         xas_store(&xas, hpage);

Okay, I realize what's going on. There is a change in mm-everything
[1] that significantly rewrites collapse_file, and my patch is going
to conflict with that patch. I'll see if I can rework my patches on
top of that change.

[1] https://lore.kernel.org/all/20221205234059.42971-3-jiaqiyan@google.com/T/#u

-David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
  2023-02-17  2:00         ` David Stevens
@ 2023-02-17  3:20           ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2023-02-17  3:20 UTC (permalink / raw)
  To: David Stevens
  Cc: Peter Xu, linux-mm, Matthew Wilcox, Andrew Morton,
	Kirill A . Shutemov, David Hildenbrand, Hugh Dickins,
	linux-kernel

On Thu, Feb 16, 2023 at 6:00 PM David Stevens <stevensd@chromium.org> wrote:
>
> On Thu, Feb 16, 2023 at 11:41 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 16, 2023 at 10:37:47AM +0900, David Stevens wrote:
> > > On Thu, Feb 16, 2023 at 7:48 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> > > > > From: David Stevens <stevensd@chromium.org>
> > > > >
> > > > > Make sure that collapse_file respects any userfaultfds registered with
> > > > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > > > for any page which it knows to be missing, it may expect a
> > > > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> > > > > collapsing a shmem range would result in replacing an empty page with a
> > > > > THP, so that it doesn't break userfaultfd.
> > > > >
> > > > > Synchronization when checking for userfaultfds in collapse_file is
> > > > > tricky because the mmap locks can't be used to prevent races with the
> > > > > registration of new userfaultfds. Instead, we provide synchronization by
> > > > > ensuring that userspace cannot observe the fact that pages are missing
> > > > > before we check for userfaultfds. Although this allows registration of a
> > > > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > > > observe any pages transition from missing to present after such a race.
> > > > > This makes such a race indistinguishable to the collapse occurring
> > > > > immediately before the userfaultfd registration.
> > > > >
> > > > > The first step to provide this synchronization is to stop filling gaps
> > > > > during the loop iterating over the target range, since the page cache
> > > > > lock can be dropped during that loop. The second step is to fill the
> > > > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > > > time, to avoid races with accesses to the page cache that only take the
> > > > > RCU read lock.
> > > > >
> > > > > This fix is targeted at khugepaged, but the change also applies to
> > > > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > > > return EBUSY if there are any missing pages (instead of succeeding on
> > > > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > > > the syscall to fail with EAGAIN.
> > > > >
> > > > > The fact that intermediate page cache state can no longer be observed
> > > > > before the rollback of a failed collapse is also technically a
> > > > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > > > is exceedingly unlikely that anything relies on being able to observe
> > > > > that transient state.
> > > > >
> > > > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > > > ---
> > > > >  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
> > > > >  1 file changed, 58 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index b648f1053d95..8c2e2349e883 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -55,6 +55,7 @@ enum scan_result {
> > > > >       SCAN_CGROUP_CHARGE_FAIL,
> > > > >       SCAN_TRUNCATED,
> > > > >       SCAN_PAGE_HAS_PRIVATE,
> > > > > +     SCAN_PAGE_FILLED,
> > > >
> > > > PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.
> > > >
> > > > >  };
> > > > >
> > > > >  #define CREATE_TRACE_POINTS
> > > > > @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > > >   *  - allocate and lock a new huge page;
> > > > >   *  - scan page cache replacing old pages with the new one
> > > > >   *    + swap/gup in pages if necessary;
> > > > > - *    + fill in gaps;
> > > >
> > > > IIUC it's not a complete removal, but just moved downwards:
> > > >
> > > > >   *    + keep old pages around in case rollback is required;
> > > > > + *  - finalize updates to the page cache;
> > > >
> > > >          + fill in gaps with RETRY entries
> > > >          + detect race conditions with userfaultfds
> > > >
> > > > >   *  - if replacing succeeds:
> > > > >   *    + copy data over;
> > > > >   *    + free old pages;
> > > > > @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >                                               result = SCAN_TRUNCATED;
> > > > >                                               goto xa_locked;
> > > > >                                       }
> > > > > -                                     xas_set(&xas, index);
> > > > > +                                     xas_set(&xas, index + 1);
> > > > >                               }
> > > > >                               if (!shmem_charge(mapping->host, 1)) {
> > > > >                                       result = SCAN_FAIL;
> > > > >                                       goto xa_locked;
> > > > >                               }
> > > > > -                             xas_store(&xas, hpage);
> > > > >                               nr_none++;
> > > > >                               continue;
> > > > >                       }
> > > > > @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> > > > >               put_page(page);
> > > > >               goto xa_unlocked;
> > > > >       }
> > > > > +
> > > > > +     if (nr_none) {
> > > > > +             struct vm_area_struct *vma;
> > > > > +             int nr_none_check = 0;
> > > > > +
> > > > > +             xas_unlock_irq(&xas);
> > > > > +             i_mmap_lock_read(mapping);
> > > > > +             xas_lock_irq(&xas);
> > > > > +
> > > > > +             xas_set(&xas, start);
> > > > > +             for (index = start; index < end; index++) {
> > > > > +                     if (!xas_next(&xas)) {
> > > > > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > > > > +                             nr_none_check++;
> > > > > +                     }
> > > > > +             }
> > > > > +
> > > > > +             if (nr_none != nr_none_check) {
> > > > > +                     result = SCAN_PAGE_FILLED;
> > > > > +                     goto immap_locked;
> > > > > +             }
> > > > > +
> > > > > +             /*
> > > > > +              * If userspace observed a missing page in a VMA with an armed
> > > > > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > > > > +              * that page, so we need to roll back to avoid suppressing such
> > > > > +              * an event. Any userfaultfds armed after this point will not be
> > > > > +              * able to observe any missing pages due to the previously
> > > > > +              * inserted retry entries.
> > > > > +              */
> > > > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > > > +                     if (userfaultfd_missing(vma)) {
> > > > > +                             result = SCAN_EXCEED_NONE_PTE;
> > > > > +                             goto immap_locked;
> > > > > +                     }
> > > > > +             }
> > > > > +
> > > > > +immap_locked:
> > > > > +             i_mmap_unlock_read(mapping);
> > > > > +             if (result != SCAN_SUCCEED) {
> > > > > +                     xas_set(&xas, start);
> > > > > +                     for (index = start; index < end; index++) {
> > > > > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > > > > +                                     xas_store(&xas, NULL);
> > > > > +                     }
> > > > > +
> > > > > +                     goto xa_locked;
> > > > > +             }
> > > > > +     }
> > > > > +
> > > >
> > > > Until here, all look fine to me (ignoring patch 1 for now; assuming the
> > > > hpage is always uptodate).
> > > >
> > > > My question is after here we'll release page cache lock again before
> > > > try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
> > > > page cache lock?  It means other threads can be spinning.  I assume page
> > > > lock is always safe and sleepable, but not sure about the page cache lock
> > > > here.
> > >
> > > We insert the multi-index entry for hpage before releasing the page
> > > cache lock, which should replace all of the XA_RETRY_ENTRYs. So the
> > > page cache will be fully up to date when we release the lock, at least
> > > in terms of which pages it contains.
> >
> > IIUC we released it before copying the pages:
> >
> > xa_locked:
> >         xas_unlock_irq(&xas);  <-------------------------------- here
> > xa_unlocked:
> >
> >         /*
> >          * If collapse is successful, flush must be done now before copying.
> >          * If collapse is unsuccessful, does flush actually need to be done?
> >          * Do it anyway, to clear the state.
> >          */
> >         try_to_unmap_flush();
> >
> > Before insertion of the multi-index:
> >
> >         /* Join all the small entries into a single multi-index entry. */
> >         xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> >         xas_store(&xas, hpage);
>
> Okay, I realize what's going on. There is a change in mm-everything
> [1] that significantly rewrites collapse_file, and my patch is going
> to conflict with that patch. I'll see if I can rework my patches on
> top of that change.
>
> [1] https://lore.kernel.org/all/20221205234059.42971-3-jiaqiyan@google.com/T/#u

Aha, thanks for the heads up. I knew this patch but I didn't notice it
had been landed in mm-unstable tree...

>
> -David

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-02-17  3:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-14  7:57 [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem David Stevens
2023-02-14  7:57 ` [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd David Stevens
2023-02-14 22:35   ` Peter Xu
2023-02-15  1:57     ` David Stevens
2023-02-15 22:27       ` Peter Xu
2023-02-15 22:48   ` Peter Xu
2023-02-16  1:37     ` David Stevens
2023-02-16 14:41       ` Peter Xu
2023-02-16 21:58         ` Yang Shi
2023-02-16 23:07           ` Peter Xu
2023-02-16 23:52             ` Yang Shi
2023-02-17  2:00         ` David Stevens
2023-02-17  3:20           ` Yang Shi
2023-02-14 15:44 ` [PATCH 1/2] mm/khugepaged: set THP as uptodate earlier for shmem Matthew Wilcox
2023-02-15  1:33   ` David Stevens
2023-02-15 22:05     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).