Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd

From: Peter Xu <peterx@redhat.com>
To: David Stevens <stevensd@chromium.org>
Cc: linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Yang Shi <shy828301@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] mm/khugepaged: skip shmem with userfaultfd
Date: Wed, 15 Feb 2023 17:48:18 -0500	[thread overview]
Message-ID: <Y+1hMsm4eQIUtag5@x1n> (raw)
In-Reply-To: <20230214075710.2401855-2-stevensd@google.com>

On Tue, Feb 14, 2023 at 04:57:10PM +0900, David Stevens wrote:
> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to take care when
> collapsing a shmem range would result in replacing an empty page with a
> THP, so that it doesn't break userfaultfd.
> 
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race.
> This makes such a race indistinguishable to the collapse occurring
> immediately before the userfaultfd registration.
> 
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
> 
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
> 
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> ---
>  mm/khugepaged.c | 66 +++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b648f1053d95..8c2e2349e883 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -55,6 +55,7 @@ enum scan_result {
>  	SCAN_CGROUP_CHARGE_FAIL,
>  	SCAN_TRUNCATED,
>  	SCAN_PAGE_HAS_PRIVATE,
> +	SCAN_PAGE_FILLED,

PS: You may want to also touch SCAN_STATUS in huge_memory.h next time.

>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -1725,8 +1726,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>   *  - allocate and lock a new huge page;
>   *  - scan page cache replacing old pages with the new one
>   *    + swap/gup in pages if necessary;
> - *    + fill in gaps;

IIUC it's not a complete removal, but just moved downwards:

>   *    + keep old pages around in case rollback is required;
> + *  - finalize updates to the page cache;

         + fill in gaps with RETRY entries
         + detect race conditions with userfaultfds

>   *  - if replacing succeeds:
>   *    + copy data over;
>   *    + free old pages;
> @@ -1805,13 +1806,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  						result = SCAN_TRUNCATED;
>  						goto xa_locked;
>  					}
> -					xas_set(&xas, index);
> +					xas_set(&xas, index + 1);
>  				}
>  				if (!shmem_charge(mapping->host, 1)) {
>  					result = SCAN_FAIL;
>  					goto xa_locked;
>  				}
> -				xas_store(&xas, hpage);
>  				nr_none++;
>  				continue;
>  			}
> @@ -1970,6 +1970,56 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		put_page(page);
>  		goto xa_unlocked;
>  	}
> +
> +	if (nr_none) {
> +		struct vm_area_struct *vma;
> +		int nr_none_check = 0;
> +
> +		xas_unlock_irq(&xas);
> +		i_mmap_lock_read(mapping);
> +		xas_lock_irq(&xas);
> +
> +		xas_set(&xas, start);
> +		for (index = start; index < end; index++) {
> +			if (!xas_next(&xas)) {
> +				xas_store(&xas, XA_RETRY_ENTRY);
> +				nr_none_check++;
> +			}
> +		}
> +
> +		if (nr_none != nr_none_check) {
> +			result = SCAN_PAGE_FILLED;
> +			goto immap_locked;
> +		}
> +
> +		/*
> +		 * If userspace observed a missing page in a VMA with an armed
> +		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> +		 * that page, so we need to roll back to avoid suppressing such
> +		 * an event. Any userfaultfds armed after this point will not be
> +		 * able to observe any missing pages due to the previously
> +		 * inserted retry entries.
> +		 */
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> +			if (userfaultfd_missing(vma)) {
> +				result = SCAN_EXCEED_NONE_PTE;
> +				goto immap_locked;
> +			}
> +		}
> +
> +immap_locked:
> +		i_mmap_unlock_read(mapping);
> +		if (result != SCAN_SUCCEED) {
> +			xas_set(&xas, start);
> +			for (index = start; index < end; index++) {
> +				if (xas_next(&xas) == XA_RETRY_ENTRY)
> +					xas_store(&xas, NULL);
> +			}
> +
> +			goto xa_locked;
> +		}
> +	}
> +

Until here, all look fine to me (ignoring patch 1 for now; assuming the
hpage is always uptodate).

My question is after here we'll release page cache lock again before
try_to_unmap_flush(), but is it safe to keep RETRY entries after releasing
page cache lock?  It means other threads can be spinning.  I assume page
lock is always safe and sleepable, but not sure about the page cache lock
here.

>  	nr = thp_nr_pages(hpage);
>  
>  	if (is_shmem)
> @@ -2068,15 +2118,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  
>  		xas_set(&xas, start);
> -		xas_for_each(&xas, page, end - 1) {
> +		end = index;
> +		for (index = start; index < end; index++) {
> +			xas_next(&xas);
>  			page = list_first_entry_or_null(&pagelist,
>  					struct page, lru);
>  			if (!page || xas.xa_index < page->index) {
> -				if (!nr_none)
> -					break;
>  				nr_none--;
> -				/* Put holes back where they were */
> -				xas_store(&xas, NULL);
>  				continue;
>  			}
>  
> @@ -2592,11 +2640,13 @@ static int madvise_collapse_errno(enum scan_result r)
>  	case SCAN_ALLOC_HUGE_PAGE_FAIL:
>  		return -ENOMEM;
>  	case SCAN_CGROUP_CHARGE_FAIL:
> +	case SCAN_EXCEED_NONE_PTE:
>  		return -EBUSY;
>  	/* Resource temporary unavailable - trying again might succeed */
>  	case SCAN_PAGE_LOCK:
>  	case SCAN_PAGE_LRU:
>  	case SCAN_DEL_PAGE_LRU:
> +	case SCAN_PAGE_FILLED:
>  		return -EAGAIN;
>  	/*
>  	 * Other: Trying again likely not to succeed / error intrinsic to
> -- 
> 2.39.1.581.gbfd45094c4-goog
> 

-- 
Peter Xu