linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] add callbacks for inaccessible pages
@ 2020-03-04 13:06 Claudio Imbrenda
  2020-03-04 13:06 ` [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages" Claudio Imbrenda
  2020-03-04 13:06 ` [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages Claudio Imbrenda
  0 siblings, 2 replies; 6+ messages in thread
From: Claudio Imbrenda @ 2020-03-04 13:06 UTC (permalink / raw)
  To: linux-next, akpm, jack, kirill
  Cc: borntraeger, david, aarcange, linux-mm, frankja, sfr, jhubbard,
	linux-kernel, linux-s390

This patchset has a fixup for gup/mm, and provides the necessary arch
hooks to enable protected virtualization.

Andrew: please simply squash/fixup the first patch into the appropriate
one that is already in your tree.

v2 -> v3:
* revert some cosmetic changes to improve readability
* improve some comments
v1 -> v2:
* use put_compound_head in the first patch
* fix commit message of the second patch
* minor code cleanups
* some comments to explain why sometimes we are not doing things

Claudio Imbrenda (2):
  mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages"
  mm/gup/writeback: add callbacks for inaccessible pages

 include/linux/gfp.h |  6 ++++
 mm/gup.c            | 76 +++++++++++++++++++++++++++++----------------
 mm/page-writeback.c |  5 +++
 3 files changed, 61 insertions(+), 26 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages"
  2020-03-04 13:06 [PATCH v3 0/2] add callbacks for inaccessible pages Claudio Imbrenda
@ 2020-03-04 13:06 ` Claudio Imbrenda
  2020-03-05 22:29   ` John Hubbard
  2020-03-04 13:06 ` [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages Claudio Imbrenda
  1 sibling, 1 reply; 6+ messages in thread
From: Claudio Imbrenda @ 2020-03-04 13:06 UTC (permalink / raw)
  To: linux-next, akpm, jack, kirill
  Cc: borntraeger, david, aarcange, linux-mm, frankja, sfr, jhubbard,
	linux-kernel, linux-s390

In case pin fails, we need to unpin, a simple put_page will not be enough

fixup for commit 9947ea2c1e608e32 ("mm/gup: track FOLL_PIN pages")

it can be simply squashed in

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 mm/gup.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f589299b0d4a..81a95fbe9901 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -116,6 +116,28 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 	return NULL;
 }
 
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
+{
+	if (flags & FOLL_PIN) {
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+				    refs);
+
+		if (hpage_pincount_available(page))
+			hpage_pincount_sub(page, refs);
+		else
+			refs *= GUP_PIN_COUNTING_BIAS;
+	}
+
+	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+	/*
+	 * Calling put_page() for each ref is unnecessarily slow. Only the last
+	 * ref needs a put_page().
+	 */
+	if (refs > 1)
+		page_ref_sub(page, refs - 1);
+	put_page(page);
+}
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  *
@@ -2134,7 +2156,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(head);
+			put_compound_head(head, 1, flags);
 			goto pte_unmap;
 		}
 
@@ -2267,28 +2289,6 @@ static int record_subpages(struct page *page, unsigned long addr,
 	return nr;
 }
 
-static void put_compound_head(struct page *page, int refs, unsigned int flags)
-{
-	if (flags & FOLL_PIN) {
-		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
-				    refs);
-
-		if (hpage_pincount_available(page))
-			hpage_pincount_sub(page, refs);
-		else
-			refs *= GUP_PIN_COUNTING_BIAS;
-	}
-
-	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
-	/*
-	 * Calling put_page() for each ref is unnecessarily slow. Only the last
-	 * ref needs a put_page().
-	 */
-	if (refs > 1)
-		page_ref_sub(page, refs - 1);
-	put_page(page);
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages
  2020-03-04 13:06 [PATCH v3 0/2] add callbacks for inaccessible pages Claudio Imbrenda
  2020-03-04 13:06 ` [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages" Claudio Imbrenda
@ 2020-03-04 13:06 ` Claudio Imbrenda
  2020-03-05 22:30   ` John Hubbard
  1 sibling, 1 reply; 6+ messages in thread
From: Claudio Imbrenda @ 2020-03-04 13:06 UTC (permalink / raw)
  To: linux-next, akpm, jack, kirill
  Cc: borntraeger, david, aarcange, linux-mm, frankja, sfr, jhubbard,
	linux-kernel, linux-s390, Will Deacon

With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.

While cpu accesses will trigger a fault that can be resolved, I/O
accesses will just fail.  We need to add a callback into architecture
code for places that will do I/O, namely when writeback is started or
when a page reference is taken.

This is not only to enable paging, file backing etc, it is also
necessary to protect the host against a malicious user space.  For
example a bad QEMU could simply start direct I/O on such protected
memory.  We do not want userspace to be able to trigger I/O errors and
thus the logic is "whenever somebody accesses that page (gup) or does
I/O, make sure that this page can be accessed".  When the guest tries
to access that page we will wait in the page fault handler for
writeback to have finished and for the page_ref to be the expected
value.

On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling
we can tackle this when we know the details.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 include/linux/gfp.h |  6 ++++++
 mm/gup.c            | 30 +++++++++++++++++++++++++++---
 mm/page-writeback.c |  5 +++++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..be2754841369 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
 #ifndef HAVE_ARCH_ALLOC_PAGE
 static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
+#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
+static inline int arch_make_page_accessible(struct page *page)
+{
+	return 0;
+}
+#endif
 
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
diff --git a/mm/gup.c b/mm/gup.c
index 81a95fbe9901..d0c4c6f336bb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t *ptep, pte;
+	int ret;
 
 	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
 	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
@@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		if (is_zero_pfn(pte_pfn(pte))) {
 			page = pte_page(pte);
 		} else {
-			int ret;
-
 			ret = follow_pfn_pte(vma, address, ptep, flags);
 			page = ERR_PTR(ret);
 			goto out;
@@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
-		int ret;
 		get_page(page);
 		pte_unmap_unlock(ptep, ptl);
 		lock_page(page);
@@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		page = ERR_PTR(-ENOMEM);
 		goto out;
 	}
+	/*
+	 * We need to make the page accessible if and only if we are going
+	 * to access its content (the FOLL_PIN case).  Please see
+	 * Documentation/core-api/pin_user_pages.rst for details.
+	 */
+	if (flags & FOLL_PIN) {
+		ret = arch_make_page_accessible(page);
+		if (ret) {
+			unpin_user_page(page);
+			page = ERR_PTR(ret);
+			goto out;
+		}
+	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -2162,6 +2173,19 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 
+		/*
+		 * We need to make the page accessible if and only if we are
+		 * going to access its content (the FOLL_PIN case).  Please
+		 * see Documentation/core-api/pin_user_pages.rst for
+		 * details.
+		 */
+		if (flags & FOLL_PIN) {
+			ret = arch_make_page_accessible(page);
+			if (ret) {
+				unpin_user_page(page);
+				goto pte_unmap;
+			}
+		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab5a3cee8ad3..8384be5a2758 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
+	/*
+	 * If writeback has been triggered on a page that cannot be made
+	 * accessible, it is too late.
+	 */
+	WARN_ON(arch_make_page_accessible(page));
 	return ret;
 
 }
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages"
  2020-03-04 13:06 ` [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages" Claudio Imbrenda
@ 2020-03-05 22:29   ` John Hubbard
  0 siblings, 0 replies; 6+ messages in thread
From: John Hubbard @ 2020-03-05 22:29 UTC (permalink / raw)
  To: Claudio Imbrenda, linux-next, akpm, jack, kirill
  Cc: borntraeger, david, aarcange, linux-mm, frankja, sfr,
	linux-kernel, linux-s390

On 3/4/20 5:06 AM, Claudio Imbrenda wrote:
> In case pin fails, we need to unpin, a simple put_page will not be enough
> 
> fixup for commit 9947ea2c1e608e32 ("mm/gup: track FOLL_PIN pages")
> 
> it can be simply squashed in
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---

Although it won't show up anywhere, in case this helps Andrew sort out what
has been looked at, you can add: 

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

>  mm/gup.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f589299b0d4a..81a95fbe9901 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -116,6 +116,28 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
>  	return NULL;
>  }
>  
> +static void put_compound_head(struct page *page, int refs, unsigned int flags)
> +{
> +	if (flags & FOLL_PIN) {
> +		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
> +				    refs);
> +
> +		if (hpage_pincount_available(page))
> +			hpage_pincount_sub(page, refs);
> +		else
> +			refs *= GUP_PIN_COUNTING_BIAS;
> +	}
> +
> +	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> +	/*
> +	 * Calling put_page() for each ref is unnecessarily slow. Only the last
> +	 * ref needs a put_page().
> +	 */
> +	if (refs > 1)
> +		page_ref_sub(page, refs - 1);
> +	put_page(page);
> +}
> +
>  /**
>   * try_grab_page() - elevate a page's refcount by a flag-dependent amount
>   *
> @@ -2134,7 +2156,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  			goto pte_unmap;
>  
>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> -			put_page(head);
> +			put_compound_head(head, 1, flags);
>  			goto pte_unmap;
>  		}
>  
> @@ -2267,28 +2289,6 @@ static int record_subpages(struct page *page, unsigned long addr,
>  	return nr;
>  }
>  
> -static void put_compound_head(struct page *page, int refs, unsigned int flags)
> -{
> -	if (flags & FOLL_PIN) {
> -		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
> -				    refs);
> -
> -		if (hpage_pincount_available(page))
> -			hpage_pincount_sub(page, refs);
> -		else
> -			refs *= GUP_PIN_COUNTING_BIAS;
> -	}
> -
> -	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> -	/*
> -	 * Calling put_page() for each ref is unnecessarily slow. Only the last
> -	 * ref needs a put_page().
> -	 */
> -	if (refs > 1)
> -		page_ref_sub(page, refs - 1);
> -	put_page(page);
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>  				      unsigned long sz)
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages
  2020-03-04 13:06 ` [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages Claudio Imbrenda
@ 2020-03-05 22:30   ` John Hubbard
  2020-03-06 11:18     ` Claudio Imbrenda
  0 siblings, 1 reply; 6+ messages in thread
From: John Hubbard @ 2020-03-05 22:30 UTC (permalink / raw)
  To: Claudio Imbrenda, linux-next, akpm, jack, kirill
  Cc: borntraeger, david, aarcange, linux-mm, frankja, sfr,
	linux-kernel, linux-s390, Will Deacon

On 3/4/20 5:06 AM, Claudio Imbrenda wrote:
> With the introduction of protected KVM guests on s390 there is now a
> concept of inaccessible pages. These pages need to be made accessible
> before the host can access them.
> 
> While cpu accesses will trigger a fault that can be resolved, I/O
> accesses will just fail.  We need to add a callback into architecture
> code for places that will do I/O, namely when writeback is started or
> when a page reference is taken.
> 
> This is not only to enable paging, file backing etc, it is also
> necessary to protect the host against a malicious user space.  For
> example a bad QEMU could simply start direct I/O on such protected
> memory.  We do not want userspace to be able to trigger I/O errors and
> thus the logic is "whenever somebody accesses that page (gup) or does
> I/O, make sure that this page can be accessed".  When the guest tries
> to access that page we will wait in the page fault handler for
> writeback to have finished and for the page_ref to be the expected
> value.
> 
> On s390x the function is not supposed to fail, so it is ok to use a
> WARN_ON on failure. If we ever need some more finegrained handling
> we can tackle this when we know the details.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Acked-by: Will Deacon <will@kernel.org>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  include/linux/gfp.h |  6 ++++++
>  mm/gup.c            | 30 +++++++++++++++++++++++++++---
>  mm/page-writeback.c |  5 +++++
>  3 files changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e5b817cb86e7..be2754841369 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
>  #ifndef HAVE_ARCH_ALLOC_PAGE
>  static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> +static inline int arch_make_page_accessible(struct page *page)
> +{
> +	return 0;
> +}
> +#endif
>  
>  struct page *
>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> diff --git a/mm/gup.c b/mm/gup.c
> index 81a95fbe9901..d0c4c6f336bb 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	struct page *page;
>  	spinlock_t *ptl;
>  	pte_t *ptep, pte;
> +	int ret;
>  
>  	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>  	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> @@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		if (is_zero_pfn(pte_pfn(pte))) {
>  			page = pte_page(pte);
>  		} else {
> -			int ret;
> -
>  			ret = follow_pfn_pte(vma, address, ptep, flags);
>  			page = ERR_PTR(ret);
>  			goto out;
> @@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	}
>  
>  	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> -		int ret;
>  		get_page(page);
>  		pte_unmap_unlock(ptep, ptl);
>  		lock_page(page);
> @@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		page = ERR_PTR(-ENOMEM);
>  		goto out;
>  	}
> +	/*
> +	 * We need to make the page accessible if and only if we are going
> +	 * to access its content (the FOLL_PIN case).  Please see
> +	 * Documentation/core-api/pin_user_pages.rst for details.
> +	 */
> +	if (flags & FOLL_PIN) {
> +		ret = arch_make_page_accessible(page);
> +		if (ret) {
> +			unpin_user_page(page);
> +			page = ERR_PTR(ret);
> +			goto out;
> +		}
> +	}
>  	if (flags & FOLL_TOUCH) {
>  		if ((flags & FOLL_WRITE) &&
>  		    !pte_dirty(pte) && !PageDirty(page))
> @@ -2162,6 +2173,19 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  
>  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>  
> +		/*
> +		 * We need to make the page accessible if and only if we are
> +		 * going to access its content (the FOLL_PIN case).  Please
> +		 * see Documentation/core-api/pin_user_pages.rst for
> +		 * details.
> +		 */
> +		if (flags & FOLL_PIN) {
> +			ret = arch_make_page_accessible(page);
> +			if (ret) {
> +				unpin_user_page(page);
> +				goto pte_unmap;
> +			}
> +		}
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
>  		(*nr)++;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab5a3cee8ad3..8384be5a2758 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  	}
>  	unlock_page_memcg(page);
> +	/*
> +	 * If writeback has been triggered on a page that cannot be made
> +	 * accessible, it is too late.
> +	 */
> +	WARN_ON(arch_make_page_accessible(page));

Hi,

Sorry for not commenting on this earlier. After looking at this a bit, I think a tiny 
tweak would be helpful, because:

a) WARN_ON() is a big problem for per-page issues, because, like ants, pages are prone to
   show up in large groups. And a warning and backtrace for each such page can easily
   bring a system to a crawl.

b) Based on your explanation of how this works, what your situation really seems to call
   for is the standard "crash hard in DEBUG builds, in order to keep developers out of
   trouble, but continue on in non-DEBUG builds".

So maybe you'd be better protected with this instead:

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab5a3cee8ad3..b7f3d0766a5f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
 int __test_set_page_writeback(struct page *page, bool keep_write)
 {
        struct address_space *mapping = page_mapping(page);
-       int ret;
+       int ret, access_ret;
 
        lock_page_memcg(page);
        if (mapping && mapping_use_writeback_tags(mapping)) {
@@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
                inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
        }
        unlock_page_memcg(page);
+       access_ret = arch_make_page_accessible(page);
+       /*
+        * If writeback has been triggered on a page that cannot be made
+        * accessible, it is too late to recover here.
+        */
+       VM_BUG_ON_PAGE(access_ret != 0, page);
+
        return ret;
 
 }

Assuming that's acceptable, you can add:

      Reviewed-by: John Hubbard <jhubbard@nvidia.com>

to the updated patch.  

thanks,
-- 
John Hubbard
NVIDIA

>  	return ret;
>  
>  }
> 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages
  2020-03-05 22:30   ` John Hubbard
@ 2020-03-06 11:18     ` Claudio Imbrenda
  0 siblings, 0 replies; 6+ messages in thread
From: Claudio Imbrenda @ 2020-03-06 11:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-next, akpm, jack, kirill, borntraeger, david, aarcange,
	linux-mm, frankja, sfr, linux-kernel, linux-s390, Will Deacon

On Thu, 5 Mar 2020 14:30:03 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> On 3/4/20 5:06 AM, Claudio Imbrenda wrote:
> > With the introduction of protected KVM guests on s390 there is now a
> > concept of inaccessible pages. These pages need to be made
> > accessible before the host can access them.
> > 
> > While cpu accesses will trigger a fault that can be resolved, I/O
> > accesses will just fail.  We need to add a callback into
> > architecture code for places that will do I/O, namely when
> > writeback is started or when a page reference is taken.
> > 
> > This is not only to enable paging, file backing etc, it is also
> > necessary to protect the host against a malicious user space.  For
> > example a bad QEMU could simply start direct I/O on such protected
> > memory.  We do not want userspace to be able to trigger I/O errors
> > and thus the logic is "whenever somebody accesses that page (gup)
> > or does I/O, make sure that this page can be accessed".  When the
> > guest tries to access that page we will wait in the page fault
> > handler for writeback to have finished and for the page_ref to be
> > the expected value.
> > 
> > On s390x the function is not supposed to fail, so it is ok to use a
> > WARN_ON on failure. If we ever need some more finegrained handling
> > we can tackle this when we know the details.
> > 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > Acked-by: Will Deacon <will@kernel.org>
> > Reviewed-by: David Hildenbrand <david@redhat.com>
> > Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > ---
> >  include/linux/gfp.h |  6 ++++++
> >  mm/gup.c            | 30 +++++++++++++++++++++++++++---
> >  mm/page-writeback.c |  5 +++++
> >  3 files changed, 38 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index e5b817cb86e7..be2754841369 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page
> > *page, int order) { } #ifndef HAVE_ARCH_ALLOC_PAGE
> >  static inline void arch_alloc_page(struct page *page, int order) {
> > } #endif
> > +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> > +static inline int arch_make_page_accessible(struct page *page)
> > +{
> > +	return 0;
> > +}
> > +#endif
> >  
> >  struct page *
> >  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > preferred_nid, diff --git a/mm/gup.c b/mm/gup.c
> > index 81a95fbe9901..d0c4c6f336bb 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, struct page *page;
> >  	spinlock_t *ptl;
> >  	pte_t *ptep, pte;
> > +	int ret;
> >  
> >  	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> >  	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> > @@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) {
> >  			page = pte_page(pte);
> >  		} else {
> > -			int ret;
> > -
> >  			ret = follow_pfn_pte(vma, address, ptep,
> > flags); page = ERR_PTR(ret);
> >  			goto out;
> > @@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, }
> >  
> >  	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> > -		int ret;
> >  		get_page(page);
> >  		pte_unmap_unlock(ptep, ptl);
> >  		lock_page(page);
> > @@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, page = ERR_PTR(-ENOMEM);
> >  		goto out;
> >  	}
> > +	/*
> > +	 * We need to make the page accessible if and only if we
> > are going
> > +	 * to access its content (the FOLL_PIN case).  Please see
> > +	 * Documentation/core-api/pin_user_pages.rst for details.
> > +	 */
> > +	if (flags & FOLL_PIN) {
> > +		ret = arch_make_page_accessible(page);
> > +		if (ret) {
> > +			unpin_user_page(page);
> > +			page = ERR_PTR(ret);
> > +			goto out;
> > +		}
> > +	}
> >  	if (flags & FOLL_TOUCH) {
> >  		if ((flags & FOLL_WRITE) &&
> >  		    !pte_dirty(pte) && !PageDirty(page))
> > @@ -2162,6 +2173,19 @@ static int gup_pte_range(pmd_t pmd, unsigned
> > long addr, unsigned long end, 
> >  		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> >  
> > +		/*
> > +		 * We need to make the page accessible if and only
> > if we are
> > +		 * going to access its content (the FOLL_PIN
> > case).  Please
> > +		 * see Documentation/core-api/pin_user_pages.rst
> > for
> > +		 * details.
> > +		 */
> > +		if (flags & FOLL_PIN) {
> > +			ret = arch_make_page_accessible(page);
> > +			if (ret) {
> > +				unpin_user_page(page);
> > +				goto pte_unmap;
> > +			}
> > +		}
> >  		SetPageReferenced(page);
> >  		pages[*nr] = page;
> >  		(*nr)++;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index ab5a3cee8ad3..8384be5a2758 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page
> > *page, bool keep_write) inc_zone_page_state(page,
> > NR_ZONE_WRITE_PENDING); }
> >  	unlock_page_memcg(page);
> > +	/*
> > +	 * If writeback has been triggered on a page that cannot
> > be made
> > +	 * accessible, it is too late.
> > +	 */
> > +	WARN_ON(arch_make_page_accessible(page));  
> 
> Hi,
> 
> Sorry for not commenting on this earlier. After looking at this a
> bit, I think a tiny tweak would be helpful, because:
> 
> a) WARN_ON() is a big problem for per-page issues, because, like
> ants, pages are prone to show up in large groups. And a warning and
> backtrace for each such page can easily bring a system to a crawl.
> 
> b) Based on your explanation of how this works, what your situation
> really seems to call for is the standard "crash hard in DEBUG builds,
> in order to keep developers out of trouble, but continue on in
> non-DEBUG builds".
> 
> So maybe you'd be better protected with this instead:
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab5a3cee8ad3..b7f3d0766a5f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
>  int __test_set_page_writeback(struct page *page, bool keep_write)
>  {
>         struct address_space *mapping = page_mapping(page);
> -       int ret;
> +       int ret, access_ret;
>  
>         lock_page_memcg(page);
>         if (mapping && mapping_use_writeback_tags(mapping)) {
> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> *page, bool keep_write) inc_zone_page_state(page,
> NR_ZONE_WRITE_PENDING); }
>         unlock_page_memcg(page);
> +       access_ret = arch_make_page_accessible(page);
> +       /*
> +        * If writeback has been triggered on a page that cannot be
> made
> +        * accessible, it is too late to recover here.
> +        */
> +       VM_BUG_ON_PAGE(access_ret != 0, page);
> +
>         return ret;
>  
>  }
> 
> Assuming that's acceptable, you can add:
> 
>       Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> to the updated patch.  

I will send an updated patch, thanks a lot for the feedback!



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-06 11:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-04 13:06 [PATCH v3 0/2] add callbacks for inaccessible pages Claudio Imbrenda
2020-03-04 13:06 ` [PATCH v3 1/2] mm/gup: fixup for 9947ea2c1e608e32 "mm/gup: track FOLL_PIN pages" Claudio Imbrenda
2020-03-05 22:29   ` John Hubbard
2020-03-04 13:06 ` [PATCH v3 2/2] mm/gup/writeback: add callbacks for inaccessible pages Claudio Imbrenda
2020-03-05 22:30   ` John Hubbard
2020-03-06 11:18     ` Claudio Imbrenda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).