All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning
@ 2023-05-25 22:39 David Howells
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: David Howells @ 2023-05-25 22:39 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm

Hi Christoph, David,

Since Christoph asked nicely[1] ;-), here are three patches that go on top
of the similar patches for bio structs now in the block tree that make the
old block direct-IO code use iov_iter_extract_pages() and page pinning.

There are three patches:

 (1) Make page pinning not add or remove a pin to/from a ZERO_PAGE, thereby
     allowing the dio code to insert zero pages in the middle of dealing
     with pinned pages.

     A pair of functions are provided to wrap the testing of a page or
     folio to see if it is a zero page.

 (2) Provide a function to allow an additional pin to be taken on a page we
     already have pinned (and do nothing for a zero page).

 (3) Switch direct-io.c over to using page pinning and to use
     iov_iter_extract_pages() so that pages from non-user-backed iterators
     aren't pinned.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-old-dio

David

Changes
=======
ver #2)
 - Fix use of ZERO_PAGE().
 - Add wrappers for testing if a page is a zero page.
 - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
 - Need to set BIO_PAGE_PINNED conditionally, and not BIO_PAGE_REFFED.

Link: https://lore.kernel.org/r/ZGxfrOLZ4aN9/MvE@infradead.org/ [1]
Link: https://lore.kernel.org/r/20230525155102.87353-1-dhowells@redhat.com/ # v1

David Howells (3):
  mm: Don't pin ZERO_PAGE in pin_user_pages()
  mm: Provide a function to get an additional pin on a page
  block: Use iov_iter_extract_pages() and page pinning in direct-io.c

 fs/direct-io.c          | 72 ++++++++++++++++++++++++-----------------
 include/linux/mm.h      |  1 +
 include/linux/pgtable.h | 10 ++++++
 mm/gup.c                | 54 ++++++++++++++++++++++++++++++-
 4 files changed, 107 insertions(+), 30 deletions(-)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-25 22:39 [RFC PATCH v2 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
@ 2023-05-25 22:39 ` David Howells
  2023-05-26  8:10   ` Lorenzo Stoakes
                     ` (3 more replies)
  2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
  2023-05-25 22:39 ` [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
  2 siblings, 4 replies; 20+ messages in thread
From: David Howells @ 2023-05-25 22:39 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.

Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #2)
     - Fix use of ZERO_PAGE().
     - Add is_zero_page() and is_zero_folio() wrappers.
     - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.

 include/linux/pgtable.h | 10 ++++++++++
 mm/gup.c                | 25 ++++++++++++++++++++++++-
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..2b0431a11de2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1245,6 +1245,16 @@ static inline unsigned long my_zero_pfn(unsigned long addr)
 }
 #endif /* CONFIG_MMU */
 
+static inline bool is_zero_page(const struct page *page)
+{
+	return is_zero_pfn(page_to_pfn(page));
+}
+
+static inline bool is_zero_folio(const struct folio *folio)
+{
+	return is_zero_page(&folio->page);
+}
+
 #ifdef CONFIG_MMU
 
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..69b002628f5d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
 		struct page *page = *pages;
 		struct folio *folio = page_folio(page);
 
-		if (!folio_test_anon(folio))
+		if (is_zero_page(page) ||
+		    !folio_test_anon(folio))
 			continue;
 		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
 			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
@@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 	else if (flags & FOLL_PIN) {
 		struct folio *folio;
 
+		/*
+		 * Don't take a pin on the zero page - it's not going anywhere
+		 * and it is used in a *lot* of places.
+		 */
+		if (is_zero_page(page))
+			return page_folio(page);
+
 		/*
 		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
 		 * right zone, so fail and let the caller fall back to the slow
@@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 {
 	if (flags & FOLL_PIN) {
+		if (is_zero_folio(folio))
+			return;
 		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
 		if (folio_test_large(folio))
 			atomic_sub(refs, &folio->_pincount);
@@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 	if (flags & FOLL_GET)
 		folio_ref_inc(folio);
 	else if (flags & FOLL_PIN) {
+		/*
+		 * Don't take a pin on the zero page - it's not going anywhere
+		 * and it is used in a *lot* of places.
+		 */
+		if (is_zero_page(page))
+			return 0;
+
 		/*
 		 * Similar to try_grab_folio(): be sure to *also*
 		 * increment the normal page refcount field at least once,
@@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
  *
  * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
  * see Documentation/core-api/pin_user_pages.rst for further details.
+ *
+ * Note that if the zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
  */
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
@@ -3161,6 +3181,9 @@ EXPORT_SYMBOL(pin_user_pages);
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
  * FOLL_PIN and rejects FOLL_GET.
+ *
+ * Note that if the zero_page is amongst the returned pages, it will not have
+ * pins in it and unpin_user_page() will not remove pins from it.
  */
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 			     struct page **pages, unsigned int gup_flags)


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-25 22:39 [RFC PATCH v2 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
@ 2023-05-25 22:39 ` David Howells
  2023-05-26  2:29   ` Linus Torvalds
                     ` (3 more replies)
  2023-05-25 22:39 ` [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
  2 siblings, 4 replies; 20+ messages in thread
From: David Howells @ 2023-05-25 22:39 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Provide a function to get an additional pin on a page that we already have
a pin on.  This will be used in fs/direct-io.c when dispatching multiple
bios to a page we've extracted from a user-backed iter rather than redoing
the extraction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..931b75dae7ff 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2383,6 +2383,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+void page_get_additional_pin(struct page *page);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
diff --git a/mm/gup.c b/mm/gup.c
index 69b002628f5d..4b4353a184ed 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -275,6 +275,35 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+/**
+ * page_get_additional_pin - Try to get an additional pin on a pinned page
+ * @page: The page to be pinned
+ *
+ * Get an additional pin on a page we already have a pin on.  Makes no change
+ * if the page is the zero_page.
+ */
+void page_get_additional_pin(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	if (page == ZERO_PAGE(0))
+		return;
+
+	/*
+	 * Similar to try_grab_folio(): be sure to *also* increment the normal
+	 * page refcount field at least once, so that the page really is
+	 * pinned.
+	 */
+	if (folio_test_large(folio)) {
+		WARN_ON_ONCE(atomic_read(&folio->_pincount) < 1);
+		folio_ref_add(folio, 1);
+		atomic_add(1, &folio->_pincount);
+	} else {
+		WARN_ON_ONCE(folio_ref_count(folio) < GUP_PIN_COUNTING_BIAS);
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+}
+
 static inline struct folio *gup_folio_range_next(struct page *start,
 		unsigned long npages, unsigned long i, unsigned int *ntails)
 {


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c
  2023-05-25 22:39 [RFC PATCH v2 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
  2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
@ 2023-05-25 22:39 ` David Howells
  2023-05-26  8:26   ` Christoph Hellwig
  2023-05-26  8:33   ` David Howells
  2 siblings, 2 replies; 20+ messages in thread
From: David Howells @ 2023-05-25 22:39 UTC (permalink / raw)
  To: Christoph Hellwig, David Hildenbrand
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

Change the old block-based direct-I/O code to use iov_iter_extract_pages()
to pin user pages or leave kernel pages unpinned rather than taking refs
when submitting bios.

This makes use of the preceding patches to not take pins on the zero page
(thereby allowing insertion of zero pages in with pinned pages) and to get
additional pins on pages, allowing an extracted page to be used in multiple
bios without having to re-extract it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #2)
     - Need to set BIO_PAGE_PINNED conditionally, not BIO_PAGE_REFFED.

 fs/direct-io.c | 72 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 43 insertions(+), 29 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index ad20f3428bab..5d4c5be0fb41 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -42,8 +42,8 @@
 #include "internal.h"
 
 /*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure in the slab cache
+ * How many user pages to map in one call to iov_iter_extract_pages().  This
+ * determines the size of a structure in the slab cache
  */
 #define DIO_PAGES	64
 
@@ -121,12 +121,13 @@ struct dio {
 	struct inode *inode;
 	loff_t i_size;			/* i_size when submitted */
 	dio_iodone_t *end_io;		/* IO completion function */
+	bool need_unpin;		/* T if we need to unpin the pages */
 
 	void *private;			/* copy from map_bh.b_private */
 
 	/* BIO completion state */
 	spinlock_t bio_lock;		/* protects BIO fields below */
-	int page_errors;		/* errno from get_user_pages() */
+	int page_errors;		/* err from iov_iter_extract_pages() */
 	int is_async;			/* is IO async ? */
 	bool defer_completion;		/* defer AIO completion to workqueue? */
 	bool should_dirty;		/* if pages should be dirtied */
@@ -165,14 +166,14 @@ static inline unsigned dio_pages_present(struct dio_submit *sdio)
  */
 static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
+	struct page **pages = dio->pages;
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				&sdio->from);
+	ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX,
+				     DIO_PAGES, 0, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
-		struct page *page = ZERO_PAGE(0);
 		/*
 		 * A memory fault, but the filesystem has some outstanding
 		 * mapped blocks.  We need to use those blocks up to avoid
@@ -180,8 +181,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
-		dio->pages[0] = page;
+		dio->pages[0] = ZERO_PAGE(0);
 		sdio->head = 0;
 		sdio->tail = 1;
 		sdio->from = 0;
@@ -201,9 +201,9 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 
 /*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
- * buffered inside the dio so that we can call get_user_pages() against a
- * decent number of pages, less frequently.  To provide nicer use of the
- * L1 cache.
+ * buffered inside the dio so that we can call iov_iter_extract_pages()
+ * against a decent number of pages, less frequently.  To provide nicer use of
+ * the L1 cache.
  */
 static inline struct page *dio_get_page(struct dio *dio,
 					struct dio_submit *sdio)
@@ -219,6 +219,18 @@ static inline struct page *dio_get_page(struct dio *dio,
 	return dio->pages[sdio->head];
 }
 
+static void dio_pin_page(struct dio *dio, struct page *page)
+{
+	if (dio->need_unpin)
+		page_get_additional_pin(page);
+}
+
+static void dio_unpin_page(struct dio *dio, struct page *page)
+{
+	if (dio->need_unpin)
+		unpin_user_page(page);
+}
+
 /*
  * dio_complete() - called when all DIO BIO I/O has been completed
  *
@@ -402,8 +414,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_aio;
 	else
 		bio->bi_end_io = dio_bio_end_io;
-	/* for now require references for all pages */
-	bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (dio->need_unpin)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
@@ -444,8 +456,9 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
  */
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
-	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+	if (dio->need_unpin)
+		unpin_user_pages(dio->pages + sdio->head,
+				 sdio->tail - sdio->head);
 }
 
 /*
@@ -676,7 +689,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
  *
  * Return zero on success.  Non-zero means the caller needs to start a new BIO.
  */
-static inline int dio_bio_add_page(struct dio_submit *sdio)
+static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio)
 {
 	int ret;
 
@@ -688,7 +701,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		dio_pin_page(dio, sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -743,11 +756,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 			goto out;
 	}
 
-	if (dio_bio_add_page(sdio) != 0) {
+	if (dio_bio_add_page(dio, sdio) != 0) {
 		dio_bio_submit(dio, sdio);
 		ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
 		if (ret == 0) {
-			ret = dio_bio_add_page(sdio);
+			ret = dio_bio_add_page(dio, sdio);
 			BUG_ON(ret != 0);
 		}
 	}
@@ -804,13 +817,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_unpin_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	dio_pin_page(dio, page);		/* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -825,7 +838,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_unpin_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -926,7 +939,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_unpin_page(dio, page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -971,7 +984,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio_op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_unpin_page(dio, page);
 					return -ENOTBLK;
 				}
 
@@ -984,7 +997,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_unpin_page(dio, page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1024,7 +1037,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_unpin_page(dio, page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1039,8 +1052,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				break;
 		}
 
-		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		/* Drop the pin which was taken in get_user_pages() */
+		dio_unpin_page(dio, page);
 	}
 out:
 	return ret;
@@ -1135,6 +1148,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		/* will be released by direct_io_worker */
 		inode_lock(inode);
 	}
+	dio->need_unpin = iov_iter_extract_will_pin(iter);
 
 	/* Once we sampled i_size check for reads beyond EOF */
 	dio->i_size = i_size_read(inode);
@@ -1259,7 +1273,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_unpin_page(dio, sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
@ 2023-05-26  2:29   ` Linus Torvalds
  2023-05-26  5:50   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2023-05-26  2:29 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Andrew Morton

On Thu, May 25, 2023 at 3:40 PM David Howells <dhowells@redhat.com> wrote:
>
> +void page_get_additional_pin(struct page *page)
> +{
> +       struct folio *folio = page_folio(page);
> +
> +       if (page == ZERO_PAGE(0))
> +               return;

You added that nice "is_zero_folio()", and then you did the above anyway..

               Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
  2023-05-26  2:29   ` Linus Torvalds
@ 2023-05-26  5:50   ` David Howells
  2023-05-26  8:20   ` Christoph Hellwig
  2023-05-26  8:31   ` David Howells
  3 siblings, 0 replies; 20+ messages in thread
From: David Howells @ 2023-05-26  5:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Christoph Hellwig, David Hildenbrand, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Andrew Morton

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > +       if (page == ZERO_PAGE(0))
> > +               return;
> 
> You added that nice "is_zero_folio()", and then you did the above anyway..

Bah.  Missed it because it was in a different patch.

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
@ 2023-05-26  8:10   ` Lorenzo Stoakes
  2023-05-26  8:17     ` Christoph Hellwig
  2023-05-26  8:22   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-26  8:10 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Thu, May 25, 2023 at 11:39:51PM +0100, David Howells wrote:
> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> to it from the page tables and make unpin_user_page*() correspondingly
> ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> zero page's refcount as we're only allowed ~2 million pins on it -
> something that userspace can conceivably trigger.

I guess we're not quite as concerned about FOLL_GET because FOLL_GET should
be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each
time?

>
> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@infradead.org>
> cc: David Hildenbrand <david@redhat.com>
> cc: Andrew Morton <akpm@linux-foundation.org>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Jan Kara <jack@suse.cz>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Jason Gunthorpe <jgg@nvidia.com>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: Hillf Danton <hdanton@sina.com>
> cc: Christian Brauner <brauner@kernel.org>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-kernel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>
> Notes:
>     ver #2)
>      - Fix use of ZERO_PAGE().
>      - Add is_zero_page() and is_zero_folio() wrappers.
>      - Return the zero page obtained, not ZERO_PAGE(0) unconditionally.
>
>  include/linux/pgtable.h | 10 ++++++++++
>  mm/gup.c                | 25 ++++++++++++++++++++++++-
>  2 files changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5a51481bbb9..2b0431a11de2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1245,6 +1245,16 @@ static inline unsigned long my_zero_pfn(unsigned long addr)
>  }
>  #endif /* CONFIG_MMU */
>
> +static inline bool is_zero_page(const struct page *page)
> +{
> +	return is_zero_pfn(page_to_pfn(page));
> +}
> +
> +static inline bool is_zero_folio(const struct folio *folio)
> +{
> +	return is_zero_page(&folio->page);
> +}
> +
>  #ifdef CONFIG_MMU
>
>  #ifndef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..69b002628f5d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -51,7 +51,8 @@ static inline void sanity_check_pinned_pages(struct page **pages,
>  		struct page *page = *pages;
>  		struct folio *folio = page_folio(page);
>
> -		if (!folio_test_anon(folio))
> +		if (is_zero_page(page) ||
> +		    !folio_test_anon(folio))
>  			continue;
>  		if (!folio_test_large(folio) || folio_test_hugetlb(folio))
>  			VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), page);
> @@ -131,6 +132,13 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  	else if (flags & FOLL_PIN) {
>  		struct folio *folio;
>
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return page_folio(page);
> +

This will capture huge page cases too which have folio->_pincount and thus
don't suffer the GUP_PIN_COUNTING_BIAS issue, however it is equally logical
to simply skip these when pinning.

This does make me think that we should just skip pinning for FOLL_GET cases
too - there's literally no sane reason we should be pinning zero pages in
any case (unless I'm missing something!)

>  		/*
>  		 * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a
>  		 * right zone, so fail and let the caller fall back to the slow
> @@ -180,6 +188,8 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>  static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
>  {
>  	if (flags & FOLL_PIN) {
> +		if (is_zero_folio(folio))
> +			return;
>  		node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>  		if (folio_test_large(folio))
>  			atomic_sub(refs, &folio->_pincount);
> @@ -224,6 +234,13 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
>  	if (flags & FOLL_GET)
>  		folio_ref_inc(folio);
>  	else if (flags & FOLL_PIN) {
> +		/*
> +		 * Don't take a pin on the zero page - it's not going anywhere
> +		 * and it is used in a *lot* of places.
> +		 */
> +		if (is_zero_page(page))
> +			return 0;
> +
>  		/*
>  		 * Similar to try_grab_folio(): be sure to *also*
>  		 * increment the normal page refcount field at least once,
> @@ -3079,6 +3096,9 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
>   *
>   * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
>   * see Documentation/core-api/pin_user_pages.rst for further details.
> + *
> + * Note that if the zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page() will not remove pins from it.
>   */
>  int pin_user_pages_fast(unsigned long start, int nr_pages,
>  			unsigned int gup_flags, struct page **pages)
> @@ -3161,6 +3181,9 @@ EXPORT_SYMBOL(pin_user_pages);
>   * pin_user_pages_unlocked() is the FOLL_PIN variant of
>   * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>   * FOLL_PIN and rejects FOLL_GET.
> + *
> + * Note that if the zero_page is amongst the returned pages, it will not have
> + * pins in it and unpin_user_page() will not remove pins from it.
>   */
>  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  			     struct page **pages, unsigned int gup_flags)
>

Shouldn't this comment be added to pup() and pup_remote() also? Also I
think it's well worth updating Documentation/core-api/pin_user_pages.rst to
reflect this as that is explclitly referenced by a few comments and it's a
worthwhile corner case to cover.

Another nitty thing that I noticed is, in is_longterm_pinnable_page():-

	/* The zero page may always be pinned */
	if (is_zero_pfn(page_to_pfn(page)))
		return true;

Which, strictly speaking I suppose we are 'pinning' it or rather allowing
the pin to succeed without actually pinning, but to be super pedantic
perhaps it's worth updating this comment too.

Other than the pedantic nits, this patch looks good and makes a lot of
sense so:-

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26  8:10   ` Lorenzo Stoakes
@ 2023-05-26  8:17     ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Howells, Christoph Hellwig, David Hildenbrand, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Fri, May 26, 2023 at 09:10:33AM +0100, Lorenzo Stoakes wrote:
> On Thu, May 25, 2023 at 11:39:51PM +0100, David Howells wrote:
> > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> > to it from the page tables and make unpin_user_page*() correspondingly
> > ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
> > zero page's refcount as we're only allowed ~2 million pins on it -
> > something that userspace can conceivably trigger.
> 
> I guess we're not quite as concerned about FOLL_GET because FOLL_GET should
> be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each
> time?

I think FOLL_GET would be just as useful.  But given that we have
a few places that release pins while gets just do a put_page it would
be a lot more effort to audit all of them.  Maybe it's better do only
do this once we've converted all the places that should do pin and
have very few FOLL_GET users left.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
  2023-05-26  2:29   ` Linus Torvalds
  2023-05-26  5:50   ` David Howells
@ 2023-05-26  8:20   ` Christoph Hellwig
  2023-05-26  8:31   ` David Howells
  3 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:20 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Thu, May 25, 2023 at 11:39:52PM +0100, David Howells wrote:
> +/**
> + * page_get_additional_pin - Try to get an additional pin on a pinned page
> + * @page: The page to be pinned
> + *
> + * Get an additional pin on a page we already have a pin on.  Makes no change
> + * if the page is the zero_page.
> + */
> +void page_get_additional_pin(struct page *page)

page_get_additional_pin seems like an odd name, mixing the get and
pin terminologies.  What about repin_page?  Or move to a folio interface
from the start can call it folio_repin?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
  2023-05-26  8:10   ` Lorenzo Stoakes
@ 2023-05-26  8:22   ` Christoph Hellwig
  2023-05-26  8:27     ` Lorenzo Stoakes
  2023-05-26  8:29   ` David Howells
  2023-05-26  8:43   ` David Howells
  3 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:22 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

Shouldn't unpin_user_pages and bio_release_page also be updated to
skip a zero page here?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c
  2023-05-25 22:39 ` [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
@ 2023-05-26  8:26   ` Christoph Hellwig
  2023-05-26  8:33   ` David Howells
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:26 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Thu, May 25, 2023 at 11:39:53PM +0100, David Howells wrote:
> Change the old block-based direct-I/O code to use iov_iter_extract_pages()
> to pin user pages or leave kernel pages unpinned rather than taking refs
> when submitting bios.
> 
> This makes use of the preceding patches to not take pins on the zero page
> (thereby allowing insertion of zero pages in with pinned pages) and to get
> additional pins on pages, allowing an extracted page to be used in multiple
> bios without having to re-extract it.

I'm not seeing where we skip the unpin of the zero page, as commented
in patch 1 (but maybe I'm not reviewing carefully enough as I'm at a
conference right now).

Otherwise my only rather cosmetic comment right now is that I'd called
the "need_unpin" member is_pinned.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26  8:22   ` Christoph Hellwig
@ 2023-05-26  8:27     ` Lorenzo Stoakes
  0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-26  8:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Fri, May 26, 2023 at 01:22:54AM -0700, Christoph Hellwig wrote:
> Shouldn't unpin_user_pages and bio_release_page also be updated to
> skip a zero page here?
>

unpin_user_pages*() all call gup_put_folio() which already skips the zero page
so we should be covered on that front.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
  2023-05-26  8:10   ` Lorenzo Stoakes
  2023-05-26  8:22   ` Christoph Hellwig
@ 2023-05-26  8:29   ` David Howells
  2023-05-26  8:43   ` David Howells
  3 siblings, 0 replies; 20+ messages in thread
From: David Howells @ 2023-05-26  8:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, David Hildenbrand, Jens Axboe, Al Viro, Matthew Wilcox,
	Jan Kara, Jeff Layton, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Andrew Morton

Christoph Hellwig <hch@infradead.org> wrote:

> Shouldn't unpin_user_pages

It calls gup_put_folio() which is where the skip is.

> and bio_release_page also be updated to skip a zero page here?

Porbably best to leave it to unpin_user_page() there.  I've tried to hide the
behaviour entirely within gup.c.

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page
  2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
                     ` (2 preceding siblings ...)
  2023-05-26  8:20   ` Christoph Hellwig
@ 2023-05-26  8:31   ` David Howells
  3 siblings, 0 replies; 20+ messages in thread
From: David Howells @ 2023-05-26  8:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, David Hildenbrand, Jens Axboe, Al Viro, Matthew Wilcox,
	Jan Kara, Jeff Layton, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Andrew Morton

Christoph Hellwig <hch@infradead.org> wrote:

> > +void page_get_additional_pin(struct page *page)
> 
> page_get_additional_pin seems like an odd name, mixing the get and
> pin terminologies.  What about repin_page?

I considered that, though repin_page() suggests putting a pin back in after
one is removed, but I can go with that if no one objects.

> Or move to a folio interface from the start can call it folio_repin?

I also considered this, but the entire gup interface is page-based at the
moment, but I can do that too:-/

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c
  2023-05-25 22:39 ` [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
  2023-05-26  8:26   ` Christoph Hellwig
@ 2023-05-26  8:33   ` David Howells
  1 sibling, 0 replies; 20+ messages in thread
From: David Howells @ 2023-05-26  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, David Hildenbrand, Jens Axboe, Al Viro, Matthew Wilcox,
	Jan Kara, Jeff Layton, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Andrew Morton

Christoph Hellwig <hch@infradead.org> wrote:

> I'm not seeing where we skip the unpin of the zero page, as commented
> in patch 1 (but maybe I'm not reviewing carefully enough as I'm at a
> conference right now).

It's done by unpin_user_page{,s}(), hidden away in gup.c.  See the commit
message for patch 1:

	Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a
	pointer to it from the page tables and make unpin_user_page*()
	correspondingly ignore a ZERO_PAGE when unpinning.

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
                     ` (2 preceding siblings ...)
  2023-05-26  8:29   ` David Howells
@ 2023-05-26  8:43   ` David Howells
  2023-05-26  8:58     ` Lorenzo Stoakes
  2023-05-26  9:15     ` David Howells
  3 siblings, 2 replies; 20+ messages in thread
From: David Howells @ 2023-05-26  8:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: dhowells, Christoph Hellwig, David Hildenbrand, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

Lorenzo Stoakes <lstoakes@gmail.com> wrote:

> I guess we're not quite as concerned about FOLL_GET because FOLL_GET should
> be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each
> time?

It's not that - it's that iov_iter_get_pages*() is a lot more commonly used at
the moment, and we'd have to find *all* the places that things using that hand
refs around.

iov_iter_extract_pages(), on the other hand, is only used in two places with
these patches and the pins are always released with unpin_user_page*() so it's
a lot easier to audit.

I could modify put_page(), folio_put(), etc. to ignore the zero pages, but
that might have a larger performance impact.

> > +		if (is_zero_page(page))
> > +			return page_folio(page);
> > +
> 
> This will capture huge page cases too which have folio->_pincount and thus
> don't suffer the GUP_PIN_COUNTING_BIAS issue, however it is equally logical
> to simply skip these when pinning.

I'm not sure I understand.  The zero page(s) is/are single-page folios?

> This does make me think that we should just skip pinning for FOLL_GET cases
> too - there's literally no sane reason we should be pinning zero pages in
> any case (unless I'm missing something!)

As mentioned above, there's a code auditing issue and a potential performance
issue, depending on how it's done.

> Another nitty thing that I noticed is, in is_longterm_pinnable_page():-
> 
> 	/* The zero page may always be pinned */
> 	if (is_zero_pfn(page_to_pfn(page)))
> 		return true;
> 
> Which, strictly speaking I suppose we are 'pinning' it or rather allowing
> the pin to succeed without actually pinning, but to be super pedantic
> perhaps it's worth updating this comment too.

Yeah.  It is "pinnable" but no pin will actually be added.

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26  8:43   ` David Howells
@ 2023-05-26  8:58     ` Lorenzo Stoakes
  2023-05-26  9:15     ` David Howells
  1 sibling, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-26  8:58 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Fri, May 26, 2023 at 09:43:35AM +0100, David Howells wrote:
> Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> > I guess we're not quite as concerned about FOLL_GET because FOLL_GET should
> > be ephemeral and FOLL_PIN (horrifically) adds GUP_PIN_COUNTING_BIAS each
> > time?
>
> It's not that - it's that iov_iter_get_pages*() is a lot more commonly used at
> the moment, and we'd have to find *all* the places that things using that hand
> refs around.
>
> iov_iter_extract_pages(), on the other hand, is only used in two places with
> these patches and the pins are always released with unpin_user_page*() so it's
> a lot easier to audit.

Thanks for the clarification. I guess these are the cases where you're likely to
see zero page usage, but since this is changing all PUP*() callers don't you
need to audit all of those too?

>
> I could modify put_page(), folio_put(), etc. to ignore the zero pages, but
> that might have a larger performance impact.
>
> > > +		if (is_zero_page(page))
> > > +			return page_folio(page);
> > > +
> >
> > This will capture huge page cases too which have folio->_pincount and thus
> > don't suffer the GUP_PIN_COUNTING_BIAS issue, however it is equally logical
> > to simply skip these when pinning.
>
> I'm not sure I understand.  The zero page(s) is/are single-page folios?

I'm actually a little unsure of how huge zero pages are handled (not an area I'm
hugely familiar with) so this might just be mistaken, I mean the point was more
so that hugetlb calls into this, but seems then not an issue.

>
> > This does make me think that we should just skip pinning for FOLL_GET cases
> > too - there's literally no sane reason we should be pinning zero pages in
> > any case (unless I'm missing something!)
>
> As mentioned above, there's a code auditing issue and a potential performance
> issue, depending on how it's done.

Ack, makes sense. It'd be good to have this documented somewhere though in
commit msg or docs so this trade-off is clear.

>
> > Another nitty thing that I noticed is, in is_longterm_pinnable_page():-
> >
> > 	/* The zero page may always be pinned */
> > 	if (is_zero_pfn(page_to_pfn(page)))
> > 		return true;
> >
> > Which, strictly speaking I suppose we are 'pinning' it or rather allowing
> > the pin to succeed without actually pinning, but to be super pedantic
> > perhaps it's worth updating this comment too.
>
> Yeah.  It is "pinnable" but no pin will actually be added.

The comment striks me as misleading, previously it literally meant that you
could pin the zero page. Now it means that we just don't. I do think for the
sake of avoiding confusion this should be tweaked.

Obviously something of a nit, however!

I did dig into this change a fair bit and kept adding then deleting comments
since you cover all the bases well, so overall this is nice + I can but nit it
:) Nice to see further improvements to GUP which is crying out for that.

>
> David
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26  8:43   ` David Howells
  2023-05-26  8:58     ` Lorenzo Stoakes
@ 2023-05-26  9:15     ` David Howells
  2023-05-26  9:23       ` Lorenzo Stoakes
  2023-05-26  9:24       ` David Hildenbrand
  1 sibling, 2 replies; 20+ messages in thread
From: David Howells @ 2023-05-26  9:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: dhowells, Christoph Hellwig, David Hildenbrand, Jens Axboe,
	Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

Lorenzo Stoakes <lstoakes@gmail.com> wrote:

> > iov_iter_extract_pages(), on the other hand, is only used in two places
> > with these patches and the pins are always released with
> > unpin_user_page*() so it's a lot easier to audit.
> 
> Thanks for the clarification. I guess these are the cases where you're
> likely to see zero page usage, but since this is changing all PUP*() callers
> don't you need to audit all of those too?

I don't think it should be necessary.  This only affects pages obtained from
gup with FOLL_PIN - and, so far as I know, those always have to be released
with unpin_user_page*() which is part of the gup API and thus it should be
transparent to the users.

Pages obtained FOLL_GET, on the other hand, aren't freed through the gup API -
and there are a bunch of ways of releasing them - and getting additional refs
too.

David


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26  9:15     ` David Howells
@ 2023-05-26  9:23       ` Lorenzo Stoakes
  2023-05-26  9:24       ` David Hildenbrand
  1 sibling, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2023-05-26  9:23 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, David Hildenbrand, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Andrew Morton

On Fri, May 26, 2023 at 10:15:26AM +0100, David Howells wrote:
> Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> > > iov_iter_extract_pages(), on the other hand, is only used in two places
> > > with these patches and the pins are always released with
> > > unpin_user_page*() so it's a lot easier to audit.
> >
> > Thanks for the clarification. I guess these are the cases where you're
> > likely to see zero page usage, but since this is changing all PUP*() callers
> > don't you need to audit all of those too?
>
> I don't think it should be necessary.  This only affects pages obtained from
> gup with FOLL_PIN - and, so far as I know, those always have to be released
> with unpin_user_page*() which is part of the gup API and thus it should be
> transparent to the users.
>

Right, I was only saying so in relation to you stating the need to audit,
for precisely this reason I wondered why you felt the need to :)

> Pages obtained FOLL_GET, on the other hand, aren't freed through the gup API -
> and there are a bunch of ways of releasing them - and getting additional refs
> too.

Yes that's a very good point! Sorry, in my enthusiasm for GUP reform this
thorny aspect slipped my mind...

As Christoph said though hopefully over time we can limit the use of FOLL_GET so
this becomes easier perhaps. Larger discussion on this area in [0] :)

[0]:https://lore.kernel.org/all/ZGWnq%2FdAYELyKpTy@infradead.org/

>
> David
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages()
  2023-05-26  9:15     ` David Howells
  2023-05-26  9:23       ` Lorenzo Stoakes
@ 2023-05-26  9:24       ` David Hildenbrand
  1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2023-05-26  9:24 UTC (permalink / raw)
  To: David Howells, Lorenzo Stoakes
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Andrew Morton

On 26.05.23 11:15, David Howells wrote:
> Lorenzo Stoakes <lstoakes@gmail.com> wrote:
> 
>>> iov_iter_extract_pages(), on the other hand, is only used in two places
>>> with these patches and the pins are always released with
>>> unpin_user_page*() so it's a lot easier to audit.
>>
>> Thanks for the clarification. I guess these are the cases where you're
>> likely to see zero page usage, but since this is changing all PUP*() callers
>> don't you need to audit all of those too?
> 
> I don't think it should be necessary.  This only affects pages obtained from
> gup with FOLL_PIN - and, so far as I know, those always have to be released
> with unpin_user_page*() which is part of the gup API and thus it should be
> transparent to the users.

Right, and even code like like 873aefb376bb ("vfio/type1: Unpin zero 
pages") would handle it transparently, because they also call 
unpin_user_page().

[we can remove 873aefb376bb even without this change way because it uses 
FOLL_LONGTERM that shouldn't return the shared zeropage anymore ]

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-05-26  9:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-25 22:39 [RFC PATCH v2 0/3] block: Make old dio use iov_iter_extract_pages() and page pinning David Howells
2023-05-25 22:39 ` [RFC PATCH v2 1/3] mm: Don't pin ZERO_PAGE in pin_user_pages() David Howells
2023-05-26  8:10   ` Lorenzo Stoakes
2023-05-26  8:17     ` Christoph Hellwig
2023-05-26  8:22   ` Christoph Hellwig
2023-05-26  8:27     ` Lorenzo Stoakes
2023-05-26  8:29   ` David Howells
2023-05-26  8:43   ` David Howells
2023-05-26  8:58     ` Lorenzo Stoakes
2023-05-26  9:15     ` David Howells
2023-05-26  9:23       ` Lorenzo Stoakes
2023-05-26  9:24       ` David Hildenbrand
2023-05-25 22:39 ` [RFC PATCH v2 2/3] mm: Provide a function to get an additional pin on a page David Howells
2023-05-26  2:29   ` Linus Torvalds
2023-05-26  5:50   ` David Howells
2023-05-26  8:20   ` Christoph Hellwig
2023-05-26  8:31   ` David Howells
2023-05-25 22:39 ` [RFC PATCH v2 3/3] block: Use iov_iter_extract_pages() and page pinning in direct-io.c David Howells
2023-05-26  8:26   ` Christoph Hellwig
2023-05-26  8:33   ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.