All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning
@ 2023-05-26 21:41 David Howells
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: David Howells @ 2023-05-26 21:41 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm

Hi Christoph, David, Lorenzo,

Here are three patches that go on top of the similar patches for bio
structs now in the block tree that make the old block direct-IO code use
iov_iter_extract_pages() and page pinning.

There are three patches:

 (1) Make page pinning neither add nor remove a pin to/from a ZERO_PAGE,
     thereby allowing the dio code to insert zero pages in the middle of
     dealing with pinned pages.  This also mitigates a potential problem
     whereby userspace could force the overrun the pin counter of a zero
     page.

     A pair of functions are provided to wrap the testing of a page or
     folio to see if it is a zero page.

 (2) Provide a function to allow an additional pin to be taken on a page we
     already have pinned (and do nothing for a zero page).

 (3) Switch direct-io.c over to using page pinning and to use
     iov_iter_extract_pages() so that pages from non-user-backed iterators
     aren't pinned.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-old-dio

David

Changes
=======
ver #4)
 - Use _inc rather than _add ops when we're just adding 1.

ver #3)
 - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons.
 - Add more comments and adjust the docs about pinning zero pages.
 - Rename page_get_additional_pin() to folio_add_pin().
 - Use is_zero_folio() in folio_add_pin().
 - Rename need_unpin to is_pinned in struct dio.

ver #2)
 - Fix use of ZERO_PAGE().
 - Add wrappers for testing if a page is a zero page.
 - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
 - Need to set BIO_PAGE_PINNED conditionally, and not BIO_PAGE_REFFED.

Link: https://lore.kernel.org/r/ZGxfrOLZ4aN9/MvE@infradead.org/ [1]
Link: https://lore.kernel.org/r/20230525155102.87353-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230525223953.225496-1-dhowells@redhat.com/ # v2
Link: https://lore.kernel.org/r/20230526112859.654506-1-dhowells@redhat.com/ # v3

David Howells (3):
  mm: Don't pin ZERO_PAGE in pin_user_pages()
  mm: Provide a function to get an additional pin on a page
  block: Use iov_iter_extract_pages() and page pinning in direct-io.c

 Documentation/core-api/pin_user_pages.rst |  6 ++
 fs/direct-io.c                            | 72 ++++++++++++++---------
 include/linux/mm.h                        | 27 ++++++++-
 mm/gup.c                                  | 58 +++++++++++++++++-
 4 files changed, 131 insertions(+), 32 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26 21:41 [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
@ 2023-05-26 21:41 ` David Howells
  2023-05-27 19:46   ` Lorenzo Stoakes
                     ` (3 more replies)
  2023-05-26 21:41 ` [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page David Howells
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 17+ messages in thread
From: David Howells @ 2023-05-26 21:41 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.

Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #3)
     - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons.
     - Add more comments and adjust the docs.
    
    ver #2)
     - Fix use of ZERO_PAGE().
     - Add is_zero_page() and is_zero_folio() wrappers.
     - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.

 Documentation/core-api/pin_user_pages.rst |  6 +++++
 include/linux/mm.h                        | 26 +++++++++++++++++--
 mm/gup.c                                  | 31 ++++++++++++++++++++++-
 3 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 9fb0b1080d3b..d3c1f6d8c0e0 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -112,6 +112,12 @@ pages:
 This also leads to limitations: there are only 31-10==21 bits available for a
 counter that increments 10 bits at a time.
 
+* Because of that limitation, special handling is applied to the zero pages
+  when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
+  refcount or pincount at all (it is permanent, so there's no need).  The
+  unpinning functions also don't do anything to a zero page.  This is
+  transparent to the caller.
+
 * Callers must specifically request "dma-pinned tracking of pages". In other
   words, just calling get_user_pages() will not suffice; a new set of functions,
   pin_user_page() and related, must be used.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..3c2f6b452586 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1910,6 +1910,28 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
 	return page_maybe_dma_pinned(page);
 }
 
+/**
+ * is_zero_page - Query if a page is a zero page
+ * @page: The page to query
+ *
+ * This returns true if @page is one of the permanent zero pages.
+ */
+static inline bool is_zero_page(const struct page *page)
+{
+	return is_zero_pfn(page_to_pfn(page));
+}
+
+/**
+ * is_zero_folio - Query if a folio is a zero page
+ * @folio: The folio to query
+ *
+ * This returns true if @folio is one of the permanent zero pages.
+ */
+static inline bool is_zero_folio(const struct folio *folio)
+{
+	return is_zero_page(&folio->page);
+}
+
 /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
 #ifdef CONFIG_MIGRATION
 static inline bool is_longterm_pinnable_page(struct page *page)
@@ -1920,8 +1942,8 @@ static inline bool is_longterm_pinnable_page(struct page *page)
 	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
 		return false;
 #endif
-	/* The zero page may always be pinned */
-	if (is_zero_pfn(page_to_pfn(page)))
+	/* The zero page can be "pinned" but gets special handling. */
+	if (is_zero_page(page))
 		return true;
 
 	/* Coherent device memory must always allow eviction. */
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..ad28261dcafd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
 		struct page *page = *pages;
 		struct folio *folio = page_folio(page);
 
-		if (!folio_test_anon(folio))
+		if (is_zero_page(page) ||
+		    !folio_test_anon(folio))
 			continue;
 		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
 			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
@@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 	else if (flags & FOLL_PIN) {
 		struct folio *folio;
 
+		/*
+		 * Don't take a pin on the zero page - it's not going anywhere
+		 * and it is used in a *lot* of places.
+		 */
+		if (is_zero_page(page))
+			return page_folio(page);
+
 		/*
 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 		 * right zone, so fail and let the caller fall back to the slow
@@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 {
 	if (flags & FOLL_PIN) {
+		if (is_zero_folio(folio))
+			return;
 		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
 		if (folio_test_large(folio))
 			atomic_sub(refs, &folio->_pincount);
@@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 	if (flags & FOLL_GET)
 		folio_ref_inc(folio);
 	else if (flags & FOLL_PIN) {
+		/*
+		 * Don't take a pin on the zero page - it's not going anywhere
+		 * and it is used in a *lot* of places.
+		 */
+		if (is_zero_page(page))
+			return 0;
+
 		/*
 		 * Similar to try_grab_folio(): be sure to *also*
 		 * increment the normal page refcount field at least once,
@@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for further details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
  */
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
@@ -3110,6 +3130,9 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages_remote(struct mm_struct *mm,
 			   unsigned long start, unsigned long nr_pages,
@@ -3143,6 +3166,9 @@ EXPORT_SYMBOL(pin_user_pages_remote);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for details.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
@@ -3161,6 +3187,9 @@ EXPORT_SYMBOL(pin_user_pages);
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
  * FOLL_PIN and rejects FOLL_GET.
+ *
+ * Note that if a zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page*() will not remove pins from it.
  */
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     struct page **pages, unsigned int gup_flags)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-26 21:41 [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
@ 2023-05-26 21:41 ` David Howells
  2023-05-31  3:57   ` Christoph Hellwig
                     ` (2 more replies)
  2023-05-26 21:41 ` [PATCH v4 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
  2023-05-31 15:48 ` [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning Jens Axboe
  3 siblings, 3 replies; 17+ messages in thread
From: David Howells @ 2023-05-26 21:41 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Provide a function to get an additional pin on a page that we already have
a pin on.  This will be used in fs/direct-io.c when dispatching multiple
bios to a page we've extracted from a user-backed iter rather than redoing
the extraction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #4)
     - Use _inc rather than _add ops when we're just adding 1.
    
    ver #3)
     - Rename to folio_add_pin().
     - Change to using is_zero_folio()

 include/linux/mm.h |  1 +
 mm/gup.c           | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2f6b452586..200068d98686 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2405,6 +2405,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+void folio_add_pin(struct folio *folio);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
diff --git a/mm/gup.c b/mm/gup.c
index ad28261dcafd..0814576b7366 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -275,6 +275,33 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+/**
+ * folio_add_pin - Try to get an additional pin on a pinned folio
+ * @folio: The folio to be pinned
+ *
+ * Get an additional pin on a folio we already have a pin on.  Makes no change
+ * if the folio is a zero_page.
+ */
+void folio_add_pin(struct folio *folio)
+{
+	if (is_zero_folio(folio))
+		return;
+
+	/*
+	 * Similar to try_grab_folio(): be sure to *also* increment the normal
+	 * page refcount field at least once, so that the page really is
+	 * pinned.
+	 */
+	if (folio_test_large(folio)) {
+		WARN_ON_ONCE(atomic_read(&folio->_pincount) < 1);
+		folio_ref_inc(folio);
+		atomic_inc(&folio->_pincount);
+	} else {
+		WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_COUNTING_BIAS);
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+}
+
 static inline struct folio *gup_folio_range_next(struct page *start,
 		unsigned long npages, unsigned long i, unsigned int *ntails)
 {


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v4 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c
  2023-05-26 21:41 [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
  2023-05-26 21:41 ` [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page David Howells
@ 2023-05-26 21:41 ` David Howells
  2023-05-31  3:58   ` Christoph Hellwig
  2023-05-31 15:48 ` [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning Jens Axboe
  3 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2023-05-26 21:41 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Change the old block-based direct-I/O code to use iov_iter_extract_pages()
to pin user pages or leave kernel pages unpinned rather than taking refs
when submitting bios.

This makes use of the preceding patches to not take pins on the zero page
(thereby allowing insertion of zero pages in with pinned pages) and to get
additional pins on pages, allowing an extracted page to be used in multiple
bios without having to re-extract it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #3)
     - Rename need_unpin to is_pinned in struct dio.
     - page_get_additional_pin() was renamed to folio_add_pin().
    
    ver #2)
     - Need to set BIO_PAGE_PINNED conditionally, not BIO_PAGE_REFFED.

 fs/direct-io.c | 72 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index ad20f3428bab..0643f1bb4b59 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -42,8 +42,8 @@
 #include "internal.h"
 
 /*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
+ * How many user pages to map in one call to iov_iter_extract_pages().  This
+ * determines the size of a structure in the slab cache
  */
 #define DIO_PAGES	64
 
@@ -121,12 +121,13 @@ struct dio {
 	struct inode *inode;
 	loff_t i_size;			/* i_size when submitted */
 	dio_iodone_t *end_io;		/* IO completion function */
+	bool is_pinned;			/* T if we have pins on the pages */
 
 	void *private;			/* copy from map_bh.b_private */
 
 	/* BIO completion state */
 	spinlock_t bio_lock;		/* protects BIO fields below */
-	int page_errors;		/* errno from get_user_pages() */
+	int page_errors;		/* err from iov_iter_extract_pages() */
 	int is_async;			/* is IO async ? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
 	bool should_dirty;		/* if pages should be dirtied */
@@ -165,14 +166,14 @@ static inline unsigned dio_pages_present(struct dio_submit *sdio)
  */
 static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
+	struct page **pages = dio->pages;
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				&sdio->from);
+	ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX,
+				     DIO_PAGES, 0, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
-		struct page *page = ZERO_PAGE(0);
 		/*
 		 * A memory fault, but the filesystem has some outstanding
 		 * mapped blocks.  We need to use those blocks up to avoid
@@ -180,8 +181,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
-		dio->pages[0] = page;
+		dio->pages[0] = ZERO_PAGE(0);
 		sdio->head = 0;
 		sdio->tail = 1;
 		sdio->from = 0;
@@ -201,9 +201,9 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
- * buffered inside the dio so that we can call get_user_pages() against a
- * decent number of pages, less frequently.  To provide nicer use of the
- * L1 cache.
+ * buffered inside the dio so that we can call iov_iter_extract_pages()
+ * against a decent number of pages, less frequently.  To provide nicer use of
+ * the L1 cache.
  */
 static inline struct page *dio_get_page(struct dio *dio,
 					struct dio_submit *sdio)
@@ -219,6 +219,18 @@ static inline struct page *dio_get_page(struct dio *dio,
 	return dio->pages[sdio->head];
 }
 
+static void dio_pin_page(struct dio *dio, struct page *page)
+{
+	if (dio->is_pinned)
+		folio_add_pin(page_folio(page));
+}
+
+static void dio_unpin_page(struct dio *dio, struct page *page)
+{
+	if (dio->is_pinned)
+		unpin_user_page(page);
+}
+
 /*
  * dio_complete() - called when all DIO BIO I/O has been completed
  *
@@ -402,8 +414,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_aio;
 	else
 		bio->bi_end_io = dio_bio_end_io;
-	/* for now require references for all pages */
-	bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (dio->is_pinned)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -444,8 +456,9 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
  */
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
-	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+	if (dio->is_pinned)
+		unpin_user_pages(dio->pages + sdio->head,
+				 sdio->tail - sdio->head);
 }
 
 /*
@@ -676,7 +689,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
  *
  * Return zero on success.  Non-zero means the caller needs to start a new BIO.
  */
-static inline int dio_bio_add_page(struct dio_submit *sdio)
+static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio)
 {
 	int ret;
 
@@ -688,7 +701,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		dio_pin_page(dio, sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -743,11 +756,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 			goto out;
 	}
 
-	if (dio_bio_add_page(sdio) != 0) {
+	if (dio_bio_add_page(dio, sdio) != 0) {
 		dio_bio_submit(dio, sdio);
 		ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
 		if (ret == 0) {
-			ret = dio_bio_add_page(sdio);
+			ret = dio_bio_add_page(dio, sdio);
 			BUG_ON(ret != 0);
 		}
 	}
@@ -804,13 +817,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_unpin_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	dio_pin_page(dio, page);		/* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -825,7 +838,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_unpin_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -926,7 +939,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_unpin_page(dio, page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -971,7 +984,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio_op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_unpin_page(dio, page);
 					return -ENOTBLK;
 				}
 
@@ -984,7 +997,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_unpin_page(dio, page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1024,7 +1037,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_unpin_page(dio, page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1039,8 +1052,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				break;
 		}
 
-		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		/* Drop the pin which was taken in get_user_pages() */
+		dio_unpin_page(dio, page);
 	}
 out:
 	return ret;
@@ -1135,6 +1148,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		/* will be released by direct_io_worker */
 		inode_lock(inode);
 	}
+	dio->is_pinned = iov_iter_extract_will_pin(iter);
 
 	/* Once we sampled i_size check for reads beyond EOF */
 	dio->i_size = i_size_read(inode);
@@ -1259,7 +1273,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_unpin_page(dio, sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
@ 2023-05-27 19:46   ` Lorenzo Stoakes
  2023-05-31  3:57   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2023-05-27 19:46 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Fri, May 26, 2023 at 10:41:40PM +0100, David Howells wrote:
> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> to it from the page tables and make unpin_user_page*() correspondingly
> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> zero page's refcount as we're only allowed ~2 million pins on it -
> something that userspace can conceivably trigger.
>
> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@infradead.org>
> cc: David Hildenbrand <david@redhat.com>
> cc: Lorenzo Stoakes <lstoakes@gmail.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Jan Kara <jack@suse.cz>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Jason Gunthorpe <jgg@nvidia.com>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: Hillf Danton <hdanton@sina.com>
> cc: Christian Brauner <brauner@kernel.org>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-kernel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>
> Notes:
>     ver #3)
>      - Move is_zero_page() and is_zero_folio() to mm.h for dependency reasons.
>      - Add more comments and adjust the docs.
>
>     ver #2)
>      - Fix use of ZERO_PAGE().
>      - Add is_zero_page() and is_zero_folio() wrappers.
>      - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
>
>  Documentation/core-api/pin_user_pages.rst |  6 +++++
>  include/linux/mm.h                        | 26 +++++++++++++++++--
>  mm/gup.c                                  | 31 ++++++++++++++++++++++-
>  3 files changed, 60 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 9fb0b1080d3b..d3c1f6d8c0e0 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -112,6 +112,12 @@ pages:
>  This also leads to limitations: there are only 31-10==21 bits available for a
>  counter that increments 10 bits at a time.
>
> +* Because of that limitation, special handling is applied to the zero pages
> +  when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
> +  refcount or pincount at all (it is permanent, so there's no need).  The
> +  unpinning functions also don't do anything to a zero page.  This is
> +  transparent to the caller.
> +

Great! Cheers. :)

>  * Callers must specifically request "dma-pinned tracking of pages". In other
>    words, just calling get_user_pages() will not suffice; a new set of functions,
>    pin_user_page() and related, must be used.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..3c2f6b452586 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1910,6 +1910,28 @@ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma,
>  	return page_maybe_dma_pinned(page);
>  }
>
> +/**
> + * is_zero_page - Query if a page is a zero page
> + * @page: The page to query
> + *
> + * This returns true if @page is one of the permanent zero pages.
> + */
> +static inline bool is_zero_page(const struct page *page)
> +{
> +	return is_zero_pfn(page_to_pfn(page));
> +}
> +
> +/**
> + * is_zero_folio - Query if a folio is a zero page
> + * @folio: The folio to query
> + *
> + * This returns true if @folio is one of the permanent zero pages.
> + */
> +static inline bool is_zero_folio(const struct folio *folio)
> +{
> +	return is_zero_page(&folio->page);
> +}
> +
>  /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */
>  #ifdef CONFIG_MIGRATION
>  static inline bool is_longterm_pinnable_page(struct page *page)
> @@ -1920,8 +1942,8 @@ static inline bool is_longterm_pinnable_page(struct page *page)
>  	if (mt == MIGRATE_CMA || mt == MIGRATE_ISOLATE)
>  		return false;
>  #endif
> -	/* The zero page may always be pinned */
> -	if (is_zero_pfn(page_to_pfn(page)))
> +	/* The zero page can be "pinned" but gets special handling. */
> +	if (is_zero_page(page))
>  		return true;
>
>  	/* Coherent device memory must always allow eviction. */
> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..ad28261dcafd 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
>  		struct page *page = *pages;
>  		struct folio *folio = page_folio(page);
>
> -		if (!folio_test_anon(folio))
> +		if (is_zero_page(page) ||
> +		    !folio_test_anon(folio))
>  			continue;
>  		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
>  			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  	else if (flags & FOLL_PIN) {
>  		struct folio *folio;
>
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return page_folio(page);
> +
>  		/*
>  		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>  		 * right zone, so fail and let the caller fall back to the slow
> @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>  {
>  	if (flags & FOLL_PIN) {
> +		if (is_zero_folio(folio))
> +			return;
>  		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>  		if (folio_test_large(folio))
>  			atomic_sub(refs, &folio->_pincount);
> @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>  	if (flags & FOLL_GET)
>  		folio_ref_inc(folio);
>  	else if (flags & FOLL_PIN) {
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return 0;
> +
>  		/*
>  		 * Similar to try_grab_folio(): be sure to *also*
>  		 * increment the normal page refcount field at least once,
> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for further details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page() will not remove pins from it.
>   */
>  int pin_user_pages_fast(unsigned long start, int nr_pages,
>  			unsigned int gup_flags, struct page **pages)
> @@ -3110,6 +3130,9 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page*() will not remove pins from it.
>   */
>  long pin_user_pages_remote(struct mm_struct *mm,
>  			   unsigned long start, unsigned long nr_pages,
> @@ -3143,6 +3166,9 @@ EXPORT_SYMBOL(pin_user_pages_remote);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page*() will not remove pins from it.
>   */
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
> @@ -3161,6 +3187,9 @@ EXPORT_SYMBOL(pin_user_pages);
>   * pin_user_pages_unlocked() is the FOLL_PIN variant of
>   * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>   * FOLL_PIN and rejects FOLL_GET.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page*() will not remove pins from it.
>   */
>  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  			     struct page **pages, unsigned int gup_flags)
>

All looks good, thanks for adding comments/updating doc!

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
  2023-05-27 19:46   ` Lorenzo Stoakes
@ 2023-05-31  3:57   ` Christoph Hellwig
  2023-05-31  7:34   ` David Hildenbrand
  2023-05-31  8:35   ` David Howells
  3 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-05-31  3:57 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes,
	Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-26 21:41 ` [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page David Howells
@ 2023-05-31  3:57   ` Christoph Hellwig
  2023-05-31  7:42   ` David Hildenbrand
  2023-05-31  8:20   ` David Howells
  2 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-05-31  3:57 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes,
	Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c
  2023-05-26 21:41 ` [PATCH v4 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
@ 2023-05-31  3:58   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2023-05-31  3:58 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes,
	Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
  2023-05-27 19:46   ` Lorenzo Stoakes
  2023-05-31  3:57   ` Christoph Hellwig
@ 2023-05-31  7:34   ` David Hildenbrand
  2023-05-31  8:35   ` David Howells
  3 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-05-31  7:34 UTC (permalink / raw)
  To: David Howells, Christoph Hellwig, Lorenzo Stoakes
  Cc: Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

On 26.05.23 23:41, David Howells wrote:
> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> to it from the page tables and make unpin_user_page*() correspondingly
> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> zero page's refcount as we're only allowed ~2 million pins on it -
> something that userspace can conceivably trigger.

2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 
million references ?

> 
> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
> 

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..ad28261dcafd 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
>   		struct page *page = *pages;
>   		struct folio *folio = page_folio(page);
>   
> -		if (!folio_test_anon(folio))
> +		if (is_zero_page(page) ||
> +		    !folio_test_anon(folio))

Nit: Fits into a single line (without harming readability IMHO).

>   			continue;
>   		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
>   			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   	else if (flags & FOLL_PIN) {
>   		struct folio *folio;
>   
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return page_folio(page);
> +
>   		/*
>   		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>   		 * right zone, so fail and let the caller fall back to the slow
> @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>   {
>   	if (flags & FOLL_PIN) {
> +		if (is_zero_folio(folio))
> +			return;
>   		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>   		if (folio_test_large(folio))
>   			atomic_sub(refs, &folio->_pincount);
> @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>   	if (flags & FOLL_GET)
>   		folio_ref_inc(folio);
>   	else if (flags & FOLL_PIN) {
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return 0;
> +
>   		/*
>   		 * Similar to try_grab_folio(): be sure to *also*
>   		 * increment the normal page refcount field at least once,
> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>    *
>    * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>    * see Documentation/core-api/pin_user_pages.rst for further details.
> + *
> + * Note that if a zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page() will not remove pins from it.
>    */

"it will not have pins in it" sounds fairly weird to a non-native speaker.

"Note that the refcount of any zero_pages returned among the pinned 
pages will not be incremented, and unpin_user_page() will similarly not 
decrement it."


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-26 21:41 ` [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page David Howells
  2023-05-31  3:57   ` Christoph Hellwig
@ 2023-05-31  7:42   ` David Hildenbrand
  2023-05-31  8:20   ` David Howells
  2 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-05-31  7:42 UTC (permalink / raw)
  To: David Howells, Christoph Hellwig, Lorenzo Stoakes
  Cc: Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton, John Hubbard

On 26.05.23 23:41, David Howells wrote:
> Provide a function to get an additional pin on a page that we already have
> a pin on.  This will be used in fs/direct-io.c when dispatching multiple
> bios to a page we've extracted from a user-backed iter rather than redoing
> the extraction.
> 

I guess this function is only used for "replicating" an existing pin, 
and not changing the semantics of an existing pin: something that was 
pinned !FOLL_LONGTERM cannot suddenly become effectively pinned 
FOLL_LONGTERM.

Out of curiosity, could we end up passing in an anonymous page, or is 
this almost exclusively for pagecache+zero pages? (I rememebr John H. 
had a similar patch where he said it would not apply to anon pages)

> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@infradead.org>
> cc: David Hildenbrand <david@redhat.com>
> cc: Lorenzo Stoakes <lstoakes@gmail.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Jan Kara <jack@suse.cz>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Jason Gunthorpe <jgg@nvidia.com>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: Hillf Danton <hdanton@sina.com>
> cc: Christian Brauner <brauner@kernel.org>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-kernel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-26 21:41 ` [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page David Howells
  2023-05-31  3:57   ` Christoph Hellwig
  2023-05-31  7:42   ` David Hildenbrand
@ 2023-05-31  8:20   ` David Howells
  2 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2023-05-31  8:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Christoph Hellwig, Lorenzo Stoakes, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton, John Hubbard

David Hildenbrand <david@redhat.com> wrote:

> > Provide a function to get an additional pin on a page that we already have
> > a pin on.  This will be used in fs/direct-io.c when dispatching multiple
> > bios to a page we've extracted from a user-backed iter rather than redoing
> > the extraction.
> > 
> 
> I guess this function is only used for "replicating" an existing pin, and not
> changing the semantics of an existing pin: something that was pinned
> !FOLL_LONGTERM cannot suddenly become effectively pinned FOLL_LONGTERM.
> 
> Out of curiosity, could we end up passing in an anonymous page, or is this
> almost exclusively for pagecache+zero pages? (I rememebr John H. had a similar
> patch where he said it would not apply to anon pages)

It has to be able to handle anything in a process's VM that you can
legitimately use as the buffer for a DIO read/write.  If we can get a pin on
it from pin_user_pages_fast(), then we need to be able to get an additional
pin on it.  For the moment it's just in the old DIO stuff, but it will almost
certainly be necessary in the networking code too to handle splicing into
network packets.

David


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
                     ` (2 preceding siblings ...)
  2023-05-31  7:34   ` David Hildenbrand
@ 2023-05-31  8:35   ` David Howells
  2023-05-31  8:46     ` David Hildenbrand
  2023-05-31 13:55     ` David Howells
  3 siblings, 2 replies; 17+ messages in thread
From: David Howells @ 2023-05-31  8:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Christoph Hellwig, Lorenzo Stoakes, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

David Hildenbrand <david@redhat.com> wrote:

> > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> > to it from the page tables and make unpin_user_page*() correspondingly
> > ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> > zero page's refcount as we're only allowed ~2 million pins on it -
> > something that userspace can conceivably trigger.
> 
> 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million
> references ?

Definitely pins.  It's tricky because we've been using "pinned" to mean held
by a refcount or held by a flag too.

2 million pins on the zero page is in the realms of possibility.  It only
takes 32768 64-page DIO writes.

> > @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
> >    *
> >    * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
> >    * see Documentation/core-api/pin_user_pages.rst for further details.
> > + *
> > + * Note that if a zero_page is amongst the returned pages, it will not have
> > + * pins in it and unpin_user_page() will not remove pins from it.
> >    */
> 
> "it will not have pins in it" sounds fairly weird to a non-native speaker.

Oh, I know.  The problem is that "pin" is now really ambiguous.  Can we change
"FOLL_PIN" to "FOLL_NAIL"?  Or maybe "FOLL_SCREW" - your pages are screwed if
you use DIO and fork at the same time.

> "Note that the refcount of any zero_pages returned among the pinned pages will
> not be incremented, and unpin_user_page() will similarly not decrement it."

That's not really right (although it happens to be true), because we're
talking primarily about the pin counter, not the refcount - and they may be
separate.

David


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-31  8:35   ` David Howells
@ 2023-05-31  8:46     ` David Hildenbrand
  2023-05-31 13:55     ` David Howells
  1 sibling, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-05-31  8:46 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Lorenzo Stoakes, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On 31.05.23 10:35, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>>> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
>>> to it from the page tables and make unpin_user_page*() correspondingly
>>> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
>>> zero page's refcount as we're only allowed ~2 million pins on it -
>>> something that userspace can conceivably trigger.
>>
>> 2 millions pins (FOLL_PIN, which increments the refcount by 1024) or 2 million
>> references ?
> 
> Definitely pins.  It's tricky because we've been using "pinned" to mean held
> by a refcount or held by a flag too.
> 

Yes, it would be clearer if we would be using "pinned" now only for 
FOLL_PIN and everything else is simply "taking a temporary reference on 
the page".

> 2 million pins on the zero page is in the realms of possibility.  It only
> takes 32768 64-page DIO writes.
> 
>>> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>>>     *
>>>     * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>>>     * see Documentation/core-api/pin_user_pages.rst for further details.
>>> + *
>>> + * Note that if a zero_page is amongst the returned pages, it will not have
>>> + * pins in it and unpin_user_page() will not remove pins from it.
>>>     */
>>
>> "it will not have pins in it" sounds fairly weird to a non-native speaker.
> 
> Oh, I know.  The problem is that "pin" is now really ambiguous.  Can we change
> "FOLL_PIN" to "FOLL_NAIL"?  Or maybe "FOLL_SCREW" - your pages are screwed if
> you use DIO and fork at the same time.
> 

I'm hoping that "pinning" will be "FOLL_PIN" (intention to access page 
content) and everything else is simply "taking a temporary page reference".

>> "Note that the refcount of any zero_pages returned among the pinned pages will
>> not be incremented, and unpin_user_page() will similarly not decrement it."
> 
> That's not really right (although it happens to be true), because we're
> talking primarily about the pin counter, not the refcount - and they may be
> separate.

In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If 
we have a separate pincount, we increment/decrement the refcount by 1 
when (un)pinning.

Sure, if we'd have a separate pincount we'd also not be modifying it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-31  8:35   ` David Howells
  2023-05-31  8:46     ` David Hildenbrand
@ 2023-05-31 13:55     ` David Howells
  2023-05-31 14:10       ` Lorenzo Stoakes
  2023-06-01 10:14       ` David Hildenbrand
  1 sibling, 2 replies; 17+ messages in thread
From: David Howells @ 2023-05-31 13:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Christoph Hellwig, Lorenzo Stoakes, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

David Hildenbrand <david@redhat.com> wrote:

> Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN

You're not likely to get that.  "To pin" is too useful a verb that gets used
in other contexts too.  For that reason, I think FOLL_PIN was a poor choice of
name:-/.  I guess the English language has got somewhat overloaded.  Maybe
FOLL_PEG? ;-)

> and everything else is simply "taking a temporary reference on the page".

Excluding refs taken with pins, many refs are more permanent than pins as, so
far as I'm aware, pins only last for the duration of an I/O operation.

> >> "Note that the refcount of any zero_pages returned among the pinned pages will
> >> not be incremented, and unpin_user_page() will similarly not decrement it."
> > That's not really right (although it happens to be true), because we're
> > talking primarily about the pin counter, not the refcount - and they may be
> > separate.
> 
> In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we
> have a separate pincount, we increment/decrement the refcount by 1 when
> (un)pinning.

FOLL_GET isn't relevant here - only FOLL_PIN.  Yes, as it happens, we count a
ref if we count a pin, but that's kind of irrelevant; what matters is that the
effect must be undone with un-PUP.

It would be nice not to get a ref on the zero page in FOLL_GET, but I don't
think we can do that yet.  Too many places assume that GUP will give them a
ref they can release later via ordinary methods.

David


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-31 13:55     ` David Howells
@ 2023-05-31 14:10       ` Lorenzo Stoakes
  2023-06-01 10:14       ` David Hildenbrand
  1 sibling, 0 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2023-05-31 14:10 UTC (permalink / raw)
  To: David Howells
  Cc: David Hildenbrand, Christoph Hellwig, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Wed, May 31, 2023 at 02:55:35PM +0100, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
>
> > Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN
>
> You're not likely to get that.  "To pin" is too useful a verb that gets used
> in other contexts too.  For that reason, I think FOLL_PIN was a poor choice of
> name:-/.  I guess the English language has got somewhat overloaded.  Maybe
> FOLL_PEG? ;-)
>

Might I suggest FOLL_FOLL? As we are essentially 'following' the page once
'pinned'. We could differentiate between what is currently FOLL_GET and
FOLL_PIN by using FOLL_FOLL and FOLL_FOLL_FOLL.

Patch series coming soon...

> > and everything else is simply "taking a temporary reference on the page".
>
> Excluding refs taken with pins, many refs are more permanent than pins as, so
> far as I'm aware, pins only last for the duration of an I/O operation.
>
> > >> "Note that the refcount of any zero_pages returned among the pinned pages will
> > >> not be incremented, and unpin_user_page() will similarly not decrement it."
> > > That's not really right (although it happens to be true), because we're
> > > talking primarily about the pin counter, not the refcount - and they may be
> > > separate.
> >
> > In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we
> > have a separate pincount, we increment/decrement the refcount by 1 when
> > (un)pinning.
>
> FOLL_GET isn't relevant here - only FOLL_PIN.  Yes, as it happens, we count a
> ref if we count a pin, but that's kind of irrelevant; what matters is that the
> effect must be undone with un-PUP.
>
> It would be nice not to get a ref on the zero page in FOLL_GET, but I don't
> think we can do that yet.  Too many places assume that GUP will give them a
> ref they can release later via ordinary methods.
>
> David
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning
  2023-05-26 21:41 [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
                   ` (2 preceding siblings ...)
  2023-05-26 21:41 ` [PATCH v4 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
@ 2023-05-31 15:48 ` Jens Axboe
  3 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-05-31 15:48 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand, Lorenzo Stoakes, David Howells
  Cc: Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Jason Gunthorpe


On Fri, 26 May 2023 22:41:39 +0100, David Howells wrote:
> Here are three patches that go on top of the similar patches for bio
> structs now in the block tree that make the old block direct-IO code use
> iov_iter_extract_pages() and page pinning.
> 
> There are three patches:
> 
>  (1) Make page pinning neither add nor remove a pin to/from a ZERO_PAGE,
>      thereby allowing the dio code to insert zero pages in the middle of
>      dealing with pinned pages.  This also mitigates a potential problem
>      whereby userspace could force the overrun the pin counter of a zero
>      page.
> 
> [...]

Applied, thanks!

[1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
      commit: c8070b78751955e59b42457b974bea4a4fe00187
[2/3] mm: Provide a function to get an additional pin on a page
      commit: 1101fb8f89e5fc548c4d0ad66750e98980291815
[3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c
      commit: 1ccf164ec866cb8575ab9b2e219fca875089c60e

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-31 13:55     ` David Howells
  2023-05-31 14:10       ` Lorenzo Stoakes
@ 2023-06-01 10:14       ` David Hildenbrand
  1 sibling, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-06-01 10:14 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Lorenzo Stoakes, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On 31.05.23 15:55, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Yes, it would be clearer if we would be using "pinned" now only for FOLL_PIN
> 
> You're not likely to get that.  "To pin" is too useful a verb that gets used
> in other contexts too.  For that reason, I think FOLL_PIN was a poor choice of
> name:-/.  I guess the English language has got somewhat overloaded.  Maybe
> FOLL_PEG? ;-)

You're probably right. FOLL_PIN and all around that is "get me an 
additional reference on the page and make sure I can DMA it without any 
undesired side-effects".

FOLL_PIN_DMA would have been clearer (and matches 
folio_maybe_dma_pinned() ) ... but then, there are some use cases where 
want the same semantics but not actually perform DMA, but simply 
read/write via the directmap (e.g., vmsplice, some io_uring cases). 
Sure, one could say that they behave like DMA: access page content at 
any time.

Saying a page is pinned (additional refcount) and having a pincount of 0 
or does indeed cause confusion.

... but once we start renaming FOLL_PIN, pincount, ... we also have to 
rename pin_user_pages() and friends, and things get nasty.

> 
>> and everything else is simply "taking a temporary reference on the page".
> 
> Excluding refs taken with pins, many refs are more permanent than pins as, so
> far as I'm aware, pins only last for the duration of an I/O operation.

I was more thinking along the lines of FOLL_GET vs. FOLL_PIN. Once we 
consider any references we might have on a page, things get more tricky 
indeed.

> 
>>>> "Note that the refcount of any zero_pages returned among the pinned pages will
>>>> not be incremented, and unpin_user_page() will similarly not decrement it."
>>> That's not really right (although it happens to be true), because we're
>>> talking primarily about the pin counter, not the refcount - and they may be
>>> separate.
>>
>> In any case (FOLL_PIN/FOLL_GET) you increment/decrement the refcount. If we
>> have a separate pincount, we increment/decrement the refcount by 1 when
>> (un)pinning.
> 
> FOLL_GET isn't relevant here - only FOLL_PIN.  Yes, as it happens, we count a
> ref if we count a pin, but that's kind of irrelevant; what matters is that the
> effect must be undone with un-PUP.

The point I was trying to make is that we always modify the refcount, 
and in some cases (FOLL_PIN on order > 0) also the pincount.

But if you define "pins" as "additional reference", we're on the same page.

> 
> It would be nice not to get a ref on the zero page in FOLL_GET, but I don't
> think we can do that yet.  Too many places assume that GUP will give them a
> ref they can release later via ordinary methods.

No we can't I'm afraid.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-06-01 10:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26 21:41 [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
2023-05-26 21:41 ` [PATCH v4 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
2023-05-27 19:46   ` Lorenzo Stoakes
2023-05-31  3:57   ` Christoph Hellwig
2023-05-31  7:34   ` David Hildenbrand
2023-05-31  8:35   ` David Howells
2023-05-31  8:46     ` David Hildenbrand
2023-05-31 13:55     ` David Howells
2023-05-31 14:10       ` Lorenzo Stoakes
2023-06-01 10:14       ` David Hildenbrand
2023-05-26 21:41 ` [PATCH v4 2/3] mm: Provide a function to get an additional pin on a page David Howells
2023-05-31  3:57   ` Christoph Hellwig
2023-05-31  7:42   ` David Hildenbrand
2023-05-31  8:20   ` David Howells
2023-05-26 21:41 ` [PATCH v4 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
2023-05-31  3:58   ` Christoph Hellwig
2023-05-31 15:48 ` [PATCH v4 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.