* [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
@ 2023-05-22 20:57 ` David Howells
2023-05-23 8:07 ` Jan Kara
2023-05-23 12:35 ` Christian Brauner
2023-05-22 20:57 ` [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it David Howells
` (7 subsequent siblings)
8 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2023-05-22 20:57 UTC (permalink / raw)
To: Jens Axboe, Al Viro, Christoph Hellwig
Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, John Hubbard, Dave Chinner,
Christoph Hellwig
ZERO_PAGE can't go away, no need to hold an extra reference.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-fsdevel@vger.kernel.org
---
fs/iomap/direct-io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 019cc87d0fb3..66a9f10e3207 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -203,7 +203,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- get_page(page);
+ bio_set_flag(bio, BIO_NO_PAGE_REF);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
2023-05-22 20:57 ` [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
@ 2023-05-23 8:07 ` Jan Kara
2023-05-23 12:35 ` Christian Brauner
1 sibling, 0 replies; 34+ messages in thread
From: Jan Kara @ 2023-05-23 8:07 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, John Hubbard, Dave Chinner,
Christoph Hellwig
On Mon 22-05-23 21:57:39, David Howells wrote:
> ZERO_PAGE can't go away, no need to hold an extra reference.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: linux-fsdevel@vger.kernel.org
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/iomap/direct-io.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 019cc87d0fb3..66a9f10e3207 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -203,7 +203,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - get_page(page);
> + bio_set_flag(bio, BIO_NO_PAGE_REF);
> __bio_add_page(bio, page, len, 0);
> iomap_dio_submit_bio(iter, dio, bio, pos);
> }
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
2023-05-22 20:57 ` [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
2023-05-23 8:07 ` Jan Kara
@ 2023-05-23 12:35 ` Christian Brauner
1 sibling, 0 replies; 34+ messages in thread
From: Christian Brauner @ 2023-05-23 12:35 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Linus Torvalds, linux-fsdevel, linux-block,
linux-kernel, linux-mm, John Hubbard, Dave Chinner,
Christoph Hellwig
On Mon, May 22, 2023 at 09:57:39PM +0100, David Howells wrote:
> ZERO_PAGE can't go away, no need to hold an extra reference.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: linux-fsdevel@vger.kernel.org
> ---
Reviewed-by: Christian Brauner <brauner@kernel.org>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
2023-05-22 20:57 ` [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
@ 2023-05-22 20:57 ` David Howells
2023-05-23 8:07 ` Jan Kara
2023-05-23 12:37 ` Christian Brauner
2023-05-22 20:57 ` [PATCH v21 3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
` (6 subsequent siblings)
8 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2023-05-22 20:57 UTC (permalink / raw)
To: Jens Axboe, Al Viro, Christoph Hellwig
Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
Fix bio_flagged() so that multiple instances of it, such as:
if (bio_flagged(bio, BIO_PAGE_REFFED) ||
bio_flagged(bio, BIO_PAGE_PINNED))
can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).
The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero. This results in an out-of-line bio_release_page()
being compiled to something like:
<+0>: mov 0x14(%rdi),%eax
<+3>: test $0x1,%al
<+5>: jne 0xffffffff816dac53 <bio_release_pages+11>
<+7>: test $0x2,%al
<+9>: je 0xffffffff816dac5c <bio_release_pages+20>
<+11>: movzbl %sil,%esi
<+15>: jmp 0xffffffff816daba1 <__bio_release_pages>
<+20>: jmp 0xffffffff81d0b800 <__x86_return_thunk>
However, the test is superfluous as the return type is bool. Removing it
results in:
<+0>: testb $0x3,0x14(%rdi)
<+4>: je 0xffffffff816e4af4 <bio_release_pages+15>
<+6>: movzbl %sil,%esi
<+10>: jmp 0xffffffff816dab7c <__bio_release_pages>
<+15>: jmp 0xffffffff81d0b7c0 <__x86_return_thunk>
instead.
Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
include/linux/bio.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b3e7529ff55e..7f53be035cf0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -229,7 +229,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
static inline bool bio_flagged(struct bio *bio, unsigned int bit)
{
- return (bio->bi_flags & (1U << bit)) != 0;
+ return bio->bi_flags & (1U << bit);
}
static inline void bio_set_flag(struct bio *bio, unsigned int bit)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it
2023-05-22 20:57 ` [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-05-23 8:07 ` Jan Kara
2023-05-23 12:37 ` Christian Brauner
1 sibling, 0 replies; 34+ messages in thread
From: Jan Kara @ 2023-05-23 8:07 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
On Mon 22-05-23 21:57:40, David Howells wrote:
> Fix bio_flagged() so that multiple instances of it, such as:
>
> if (bio_flagged(bio, BIO_PAGE_REFFED) ||
> bio_flagged(bio, BIO_PAGE_PINNED))
>
> can be combined by the gcc optimiser into a single test in assembly
> (arguably, this is a compiler optimisation issue[1]).
>
> The missed optimisation stems from bio_flagged() comparing the result of
> the bitwise-AND to zero. This results in an out-of-line bio_release_page()
> being compiled to something like:
>
> <+0>: mov 0x14(%rdi),%eax
> <+3>: test $0x1,%al
> <+5>: jne 0xffffffff816dac53 <bio_release_pages+11>
> <+7>: test $0x2,%al
> <+9>: je 0xffffffff816dac5c <bio_release_pages+20>
> <+11>: movzbl %sil,%esi
> <+15>: jmp 0xffffffff816daba1 <__bio_release_pages>
> <+20>: jmp 0xffffffff81d0b800 <__x86_return_thunk>
>
> However, the test is superfluous as the return type is bool. Removing it
> results in:
>
> <+0>: testb $0x3,0x14(%rdi)
> <+4>: je 0xffffffff816e4af4 <bio_release_pages+15>
> <+6>: movzbl %sil,%esi
> <+10>: jmp 0xffffffff816dab7c <__bio_release_pages>
> <+15>: jmp 0xffffffff81d0b7c0 <__x86_return_thunk>
>
> instead.
>
> Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
> 're-booling' the mark_dirty parameter.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: linux-block@vger.kernel.org
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
> Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
Sure. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/bio.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b3e7529ff55e..7f53be035cf0 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -229,7 +229,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
>
> static inline bool bio_flagged(struct bio *bio, unsigned int bit)
> {
> - return (bio->bi_flags & (1U << bit)) != 0;
> + return bio->bi_flags & (1U << bit);
> }
>
> static inline void bio_set_flag(struct bio *bio, unsigned int bit)
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it
2023-05-22 20:57 ` [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-05-23 8:07 ` Jan Kara
@ 2023-05-23 12:37 ` Christian Brauner
1 sibling, 0 replies; 34+ messages in thread
From: Christian Brauner @ 2023-05-23 12:37 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Linus Torvalds, linux-fsdevel, linux-block,
linux-kernel, linux-mm, Christoph Hellwig, John Hubbard
On Mon, May 22, 2023 at 09:57:40PM +0100, David Howells wrote:
> Fix bio_flagged() so that multiple instances of it, such as:
>
> if (bio_flagged(bio, BIO_PAGE_REFFED) ||
> bio_flagged(bio, BIO_PAGE_PINNED))
>
> can be combined by the gcc optimiser into a single test in assembly
> (arguably, this is a compiler optimisation issue[1]).
>
> The missed optimisation stems from bio_flagged() comparing the result of
> the bitwise-AND to zero. This results in an out-of-line bio_release_page()
> being compiled to something like:
>
> <+0>: mov 0x14(%rdi),%eax
> <+3>: test $0x1,%al
> <+5>: jne 0xffffffff816dac53 <bio_release_pages+11>
> <+7>: test $0x2,%al
> <+9>: je 0xffffffff816dac5c <bio_release_pages+20>
> <+11>: movzbl %sil,%esi
> <+15>: jmp 0xffffffff816daba1 <__bio_release_pages>
> <+20>: jmp 0xffffffff81d0b800 <__x86_return_thunk>
>
> However, the test is superfluous as the return type is bool. Removing it
> results in:
>
> <+0>: testb $0x3,0x14(%rdi)
> <+4>: je 0xffffffff816e4af4 <bio_release_pages+15>
> <+6>: movzbl %sil,%esi
> <+10>: jmp 0xffffffff816dab7c <__bio_release_pages>
> <+15>: jmp 0xffffffff81d0b7c0 <__x86_return_thunk>
>
> instead.
>
> Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
> 're-booling' the mark_dirty parameter.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: linux-block@vger.kernel.org
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
> Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
> Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
> ---
Reviewed-by: Christian Brauner <brauner@kernel.org>
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v21 3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
2023-05-22 20:57 ` [PATCH v21 1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
2023-05-22 20:57 ` [PATCH v21 2/6] block: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-05-22 20:57 ` David Howells
2023-05-23 8:07 ` Jan Kara
2023-05-22 20:57 ` [PATCH v21 4/6] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
` (5 subsequent siblings)
8 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-22 20:57 UTC (permalink / raw)
To: Jens Axboe, Al Viro, Christoph Hellwig
Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
From: Christoph Hellwig <hch@lst.de>
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---
Notes:
ver #8)
- Split out from another patch [hch].
- Don't default to BIO_PAGE_REFFED [hch].
ver #5)
- Split from patch that uses iov_iter_extract_pages().
block/bio.c | 2 +-
block/blk-map.c | 1 +
fs/direct-io.c | 2 ++
fs/iomap/direct-io.c | 1 -
include/linux/bio.h | 2 +-
include/linux/blk_types.h | 2 +-
6 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 043944fd46eb..8516adeaea26 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1191,7 +1191,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
bio->bi_io_vec = (struct bio_vec *)iter->bvec;
bio->bi_iter.bi_bvec_done = iter->iov_offset;
bio->bi_iter.bi_size = size;
- bio_set_flag(bio, BIO_NO_PAGE_REF);
bio_set_flag(bio, BIO_CLONED);
}
@@ -1336,6 +1335,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}
+ bio_set_flag(bio, BIO_PAGE_REFFED);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-map.c b/block/blk-map.c
index 04c55f1c492e..33d9f6e89ba6 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extraction_flags |= ITER_ALLOW_P2PDMA;
+ bio_set_flag(bio, BIO_PAGE_REFFED);
while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
ssize_t bytes;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0b380bb8a81e..ad20f3428bab 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -402,6 +402,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
bio->bi_end_io = dio_bio_end_aio;
else
bio->bi_end_io = dio_bio_end_io;
+ /* for now require references for all pages */
+ bio_set_flag(bio, BIO_PAGE_REFFED);
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
}
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 66a9f10e3207..08873f0627dd 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -203,7 +203,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
bio->bi_private = dio;
bio->bi_end_io = iomap_dio_bio_end_io;
- bio_set_flag(bio, BIO_NO_PAGE_REF);
__bio_add_page(bio, page, len, 0);
iomap_dio_submit_bio(iter, dio, bio, pos);
}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7f53be035cf0..0922729acd26 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -488,7 +488,7 @@ void zero_fill_bio(struct bio *bio);
static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+ if (bio_flagged(bio, BIO_PAGE_REFFED))
__bio_release_pages(bio, mark_dirty);
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 740afe80f297..dfd2c2cb909d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,7 +323,7 @@ struct bio {
* bio flags
*/
enum {
- BIO_NO_PAGE_REF, /* don't put release vec pages */
+ BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
BIO_QUIET, /* Make BIO Quiet */
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v21 3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
2023-05-22 20:57 ` [PATCH v21 3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
@ 2023-05-23 8:07 ` Jan Kara
0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2023-05-23 8:07 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
On Mon 22-05-23 21:57:41, David Howells wrote:
> From: Christoph Hellwig <hch@lst.de>
>
> Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
> meaning is only set when a page reference has been acquired that needs to
> be released by bio_release_pages().
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Jan Kara <jack@suse.cz>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: linux-block@vger.kernel.org
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
>
> Notes:
> ver #8)
> - Split out from another patch [hch].
> - Don't default to BIO_PAGE_REFFED [hch].
>
> ver #5)
> - Split from patch that uses iov_iter_extract_pages().
>
> block/bio.c | 2 +-
> block/blk-map.c | 1 +
> fs/direct-io.c | 2 ++
> fs/iomap/direct-io.c | 1 -
> include/linux/bio.h | 2 +-
> include/linux/blk_types.h | 2 +-
> 6 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 043944fd46eb..8516adeaea26 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1191,7 +1191,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
> bio->bi_io_vec = (struct bio_vec *)iter->bvec;
> bio->bi_iter.bi_bvec_done = iter->iov_offset;
> bio->bi_iter.bi_size = size;
> - bio_set_flag(bio, BIO_NO_PAGE_REF);
> bio_set_flag(bio, BIO_CLONED);
> }
>
> @@ -1336,6 +1335,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> return 0;
> }
>
> + bio_set_flag(bio, BIO_PAGE_REFFED);
> do {
> ret = __bio_iov_iter_get_pages(bio, iter);
> } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 04c55f1c492e..33d9f6e89ba6 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
> if (blk_queue_pci_p2pdma(rq->q))
> extraction_flags |= ITER_ALLOW_P2PDMA;
>
> + bio_set_flag(bio, BIO_PAGE_REFFED);
> while (iov_iter_count(iter)) {
> struct page **pages, *stack_pages[UIO_FASTIOV];
> ssize_t bytes;
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 0b380bb8a81e..ad20f3428bab 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -402,6 +402,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
> bio->bi_end_io = dio_bio_end_aio;
> else
> bio->bi_end_io = dio_bio_end_io;
> + /* for now require references for all pages */
> + bio_set_flag(bio, BIO_PAGE_REFFED);
> sdio->bio = bio;
> sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
> }
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 66a9f10e3207..08873f0627dd 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -203,7 +203,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> bio->bi_private = dio;
> bio->bi_end_io = iomap_dio_bio_end_io;
>
> - bio_set_flag(bio, BIO_NO_PAGE_REF);
> __bio_add_page(bio, page, len, 0);
> iomap_dio_submit_bio(iter, dio, bio, pos);
> }
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 7f53be035cf0..0922729acd26 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -488,7 +488,7 @@ void zero_fill_bio(struct bio *bio);
>
> static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
> {
> - if (!bio_flagged(bio, BIO_NO_PAGE_REF))
> + if (bio_flagged(bio, BIO_PAGE_REFFED))
> __bio_release_pages(bio, mark_dirty);
> }
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 740afe80f297..dfd2c2cb909d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -323,7 +323,7 @@ struct bio {
> * bio flags
> */
> enum {
> - BIO_NO_PAGE_REF, /* don't put release vec pages */
> + BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
> BIO_CLONED, /* doesn't own data */
> BIO_BOUNCED, /* bio is a bounce bio */
> BIO_QUIET, /* Make BIO Quiet */
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v21 4/6] block: Add BIO_PAGE_PINNED and associated infrastructure
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
` (2 preceding siblings ...)
2023-05-22 20:57 ` [PATCH v21 3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
@ 2023-05-22 20:57 ` David Howells
2023-05-23 8:08 ` Jan Kara
2023-05-22 20:57 ` [PATCH v21 5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
` (4 subsequent siblings)
8 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-22 20:57 UTC (permalink / raw)
To: Jens Axboe, Al Viro, Christoph Hellwig
Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---
Notes:
ver #10)
- Drop bio_set_cleanup_mode(), open coding it instead.
ver #9)
- Only consider pinning in bio_set_cleanup_mode(). Ref'ing pages in
struct bio is going away.
- page_put_unpin() is removed; call unpin_user_page() and put_page()
directly.
- Use bio_release_page() in __bio_release_pages().
- BIO_PAGE_PINNED and BIO_PAGE_REFFED can't both be set, so use if-else
when testing both of them.
ver #8)
- Move the infrastructure to clean up pinned pages to this patch [hch].
- Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
probably be removed at some point. FOLL_PIN can then be renumbered
first.
block/bio.c | 6 +++---
block/blk.h | 12 ++++++++++++
include/linux/bio.h | 3 ++-
include/linux/blk_types.h | 1 +
4 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 8516adeaea26..17bd01ecde36 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1169,7 +1169,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
- put_page(bvec->bv_page);
+ bio_release_page(bio, bvec->bv_page);
}
}
EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1489,8 +1489,8 @@ void bio_set_pages_dirty(struct bio *bio)
* the BIO and re-dirty the pages in process context.
*
* It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on. It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on. It will unpin each page and will run one bio_put() against the
+ * BIO.
*/
static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk.h b/block/blk.h
index 45547bcf1119..e1ded2ccb3ca 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -420,6 +420,18 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
unsigned int max_sectors, bool *same_page);
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+ if (bio_flagged(bio, BIO_PAGE_PINNED))
+ unpin_user_page(page);
+ else if (bio_flagged(bio, BIO_PAGE_REFFED))
+ put_page(page);
+}
+
struct request_queue *blk_alloc_queue(int node_id);
int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0922729acd26..8588bcfbc6ef 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -488,7 +488,8 @@ void zero_fill_bio(struct bio *bio);
static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (bio_flagged(bio, BIO_PAGE_REFFED))
+ if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ bio_flagged(bio, BIO_PAGE_PINNED))
__bio_release_pages(bio, mark_dirty);
}
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dfd2c2cb909d..8ef209e3aa96 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,6 +323,7 @@ struct bio {
* bio flags
*/
enum {
+ BIO_PAGE_PINNED, /* Unpin pages in bio_release_pages() */
BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
BIO_CLONED, /* doesn't own data */
BIO_BOUNCED, /* bio is a bounce bio */
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v21 4/6] block: Add BIO_PAGE_PINNED and associated infrastructure
2023-05-22 20:57 ` [PATCH v21 4/6] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
@ 2023-05-23 8:08 ` Jan Kara
0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2023-05-23 8:08 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
On Mon 22-05-23 21:57:42, David Howells wrote:
> Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
> (FOLL_PIN) and that the pin will need removing.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Jan Kara <jack@suse.cz>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: linux-block@vger.kernel.org
Looks good to me. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
>
> Notes:
> ver #10)
> - Drop bio_set_cleanup_mode(), open coding it instead.
>
> ver #9)
> - Only consider pinning in bio_set_cleanup_mode(). Ref'ing pages in
> struct bio is going away.
> - page_put_unpin() is removed; call unpin_user_page() and put_page()
> directly.
> - Use bio_release_page() in __bio_release_pages().
> - BIO_PAGE_PINNED and BIO_PAGE_REFFED can't both be set, so use if-else
> when testing both of them.
>
> ver #8)
> - Move the infrastructure to clean up pinned pages to this patch [hch].
> - Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
> probably be removed at some point. FOLL_PIN can then be renumbered
> first.
>
> block/bio.c | 6 +++---
> block/blk.h | 12 ++++++++++++
> include/linux/bio.h | 3 ++-
> include/linux/blk_types.h | 1 +
> 4 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 8516adeaea26..17bd01ecde36 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1169,7 +1169,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
> bio_for_each_segment_all(bvec, bio, iter_all) {
> if (mark_dirty && !PageCompound(bvec->bv_page))
> set_page_dirty_lock(bvec->bv_page);
> - put_page(bvec->bv_page);
> + bio_release_page(bio, bvec->bv_page);
> }
> }
> EXPORT_SYMBOL_GPL(__bio_release_pages);
> @@ -1489,8 +1489,8 @@ void bio_set_pages_dirty(struct bio *bio)
> * the BIO and re-dirty the pages in process context.
> *
> * It is expected that bio_check_pages_dirty() will wholly own the BIO from
> - * here on. It will run one put_page() against each page and will run one
> - * bio_put() against the BIO.
> + * here on. It will unpin each page and will run one bio_put() against the
> + * BIO.
> */
>
> static void bio_dirty_fn(struct work_struct *work);
> diff --git a/block/blk.h b/block/blk.h
> index 45547bcf1119..e1ded2ccb3ca 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -420,6 +420,18 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
> struct page *page, unsigned int len, unsigned int offset,
> unsigned int max_sectors, bool *same_page);
>
> +/*
> + * Clean up a page appropriately, where the page may be pinned, may have a
> + * ref taken on it or neither.
> + */
> +static inline void bio_release_page(struct bio *bio, struct page *page)
> +{
> + if (bio_flagged(bio, BIO_PAGE_PINNED))
> + unpin_user_page(page);
> + else if (bio_flagged(bio, BIO_PAGE_REFFED))
> + put_page(page);
> +}
> +
> struct request_queue *blk_alloc_queue(int node_id);
>
> int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 0922729acd26..8588bcfbc6ef 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -488,7 +488,8 @@ void zero_fill_bio(struct bio *bio);
>
> static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
> {
> - if (bio_flagged(bio, BIO_PAGE_REFFED))
> + if (bio_flagged(bio, BIO_PAGE_REFFED) ||
> + bio_flagged(bio, BIO_PAGE_PINNED))
> __bio_release_pages(bio, mark_dirty);
> }
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index dfd2c2cb909d..8ef209e3aa96 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -323,6 +323,7 @@ struct bio {
> * bio flags
> */
> enum {
> + BIO_PAGE_PINNED, /* Unpin pages in bio_release_pages() */
> BIO_PAGE_REFFED, /* put pages in bio_release_pages() */
> BIO_CLONED, /* doesn't own data */
> BIO_BOUNCED, /* bio is a bounce bio */
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v21 5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
` (3 preceding siblings ...)
2023-05-22 20:57 ` [PATCH v21 4/6] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
@ 2023-05-22 20:57 ` David Howells
2023-05-23 8:15 ` Jan Kara
2023-05-22 20:57 ` [PATCH v21 6/6] block: convert bio_map_user_iov " David Howells
` (3 subsequent siblings)
8 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-22 20:57 UTC (permalink / raw)
To: Jens Axboe, Al Viro, Christoph Hellwig
Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.
The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---
Notes:
ver #10)
- Drop bio_set_cleanup_mode(), open coding it instead.
ver #8)
- Split the patch up a bit [hch].
- We should only be using pinned/non-pinned pages and not ref'd pages,
so adjust the comments appropriately.
ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.
block/bio.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 17bd01ecde36..798cc4cf3bd2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1205,7 +1205,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
}
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}
@@ -1219,7 +1219,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
queue_max_zone_append_sectors(q), &same_page) != len)
return -EINVAL;
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
return 0;
}
@@ -1230,10 +1230,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
* @bio: bio to add pages to
* @iter: iov iterator describing the region to be mapped
*
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array. The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
*/
static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
{
@@ -1265,9 +1265,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
- size = iov_iter_get_pages(iter, pages,
- UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset, extraction_flags);
+ size = iov_iter_extract_pages(iter, &pages,
+ UINT_MAX - bio->bi_iter.bi_size,
+ nr_pages, extraction_flags, &offset);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
@@ -1300,7 +1300,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
iov_iter_revert(iter, left);
out:
while (i < nr_pages)
- put_page(pages[i++]);
+ bio_release_page(bio, pages[i++]);
return ret;
}
@@ -1335,7 +1335,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
return 0;
}
- bio_set_flag(bio, BIO_PAGE_REFFED);
+ if (iov_iter_extract_will_pin(iter))
+ bio_set_flag(bio, BIO_PAGE_PINNED);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v21 5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
2023-05-22 20:57 ` [PATCH v21 5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
@ 2023-05-23 8:15 ` Jan Kara
0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2023-05-23 8:15 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
On Mon 22-05-23 21:57:43, David Howells wrote:
> This will pin pages or leave them unaltered rather than getting a ref on
> them as appropriate to the iterator.
>
> The pages need to be pinned for DIO rather than having refs taken on them to
> prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
> result of the I/O could otherwise end up being affected by/visible to the
> child process).
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Jan Kara <jack@suse.cz>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: linux-block@vger.kernel.org
> ---
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
>
> Notes:
> ver #10)
> - Drop bio_set_cleanup_mode(), open coding it instead.
>
> ver #8)
> - Split the patch up a bit [hch].
> - We should only be using pinned/non-pinned pages and not ref'd pages,
> so adjust the comments appropriately.
>
> ver #7)
> - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
>
> ver #5)
> - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
> BIO_* flags and got rid of bi_cleanup_mode.
> - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.
>
> block/bio.c | 23 ++++++++++++-----------
> 1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 17bd01ecde36..798cc4cf3bd2 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1205,7 +1205,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
> }
>
> if (same_page)
> - put_page(page);
> + bio_release_page(bio, page);
> return 0;
> }
>
> @@ -1219,7 +1219,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
> queue_max_zone_append_sectors(q), &same_page) != len)
> return -EINVAL;
> if (same_page)
> - put_page(page);
> + bio_release_page(bio, page);
> return 0;
> }
>
> @@ -1230,10 +1230,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
> * @bio: bio to add pages to
> * @iter: iov iterator describing the region to be mapped
> *
> - * Pins pages from *iter and appends them to @bio's bvec array. The
> - * pages will have to be released using put_page() when done.
> - * For multi-segment *iter, this function only adds pages from the
> - * next non-empty segment of the iov iterator.
> + * Extracts pages from *iter and appends them to @bio's bvec array. The pages
> + * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
> + * For a multi-segment *iter, this function only adds pages from the next
> + * non-empty segment of the iov iterator.
> */
> static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> {
> @@ -1265,9 +1265,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> * result to ensure the bio's total size is correct. The remainder of
> * the iov data will be picked up in the next bio iteration.
> */
> - size = iov_iter_get_pages(iter, pages,
> - UINT_MAX - bio->bi_iter.bi_size,
> - nr_pages, &offset, extraction_flags);
> + size = iov_iter_extract_pages(iter, &pages,
> + UINT_MAX - bio->bi_iter.bi_size,
> + nr_pages, extraction_flags, &offset);
> if (unlikely(size <= 0))
> return size ? size : -EFAULT;
>
> @@ -1300,7 +1300,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> iov_iter_revert(iter, left);
> out:
> while (i < nr_pages)
> - put_page(pages[i++]);
> + bio_release_page(bio, pages[i++]);
>
> return ret;
> }
> @@ -1335,7 +1335,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> return 0;
> }
>
> - bio_set_flag(bio, BIO_PAGE_REFFED);
> + if (iov_iter_extract_will_pin(iter))
> + bio_set_flag(bio, BIO_PAGE_PINNED);
> do {
> ret = __bio_iov_iter_get_pages(bio, iter);
> } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v21 6/6] block: convert bio_map_user_iov to use iov_iter_extract_pages
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
` (4 preceding siblings ...)
2023-05-22 20:57 ` [PATCH v21 5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
@ 2023-05-22 20:57 ` David Howells
2023-05-23 8:14 ` Jan Kara
2023-05-23 6:39 ` [PATCH v21 0/6] block: Use page pinning Christoph Hellwig
` (2 subsequent siblings)
8 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-22 20:57 UTC (permalink / raw)
To: Jens Axboe, Al Viro, Christoph Hellwig
Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.
The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---
Notes:
ver #10)
- Drop bio_set_cleanup_mode(), open coding it instead.
ver #8)
- Split the patch up a bit [hch].
- We should only be using pinned/non-pinned pages and not ref'd pages,
so adjust the comments appropriately.
ver #7)
- Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
ver #5)
- Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
BIO_* flags and got rid of bi_cleanup_mode.
- Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.
block/blk-map.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 33d9f6e89ba6..3551c3ff17cf 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -281,22 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (blk_queue_pci_p2pdma(rq->q))
extraction_flags |= ITER_ALLOW_P2PDMA;
+ if (iov_iter_extract_will_pin(iter))
+ bio_set_flag(bio, BIO_PAGE_PINNED);
- bio_set_flag(bio, BIO_PAGE_REFFED);
while (iov_iter_count(iter)) {
- struct page **pages, *stack_pages[UIO_FASTIOV];
+ struct page *stack_pages[UIO_FASTIOV];
+ struct page **pages = stack_pages;
ssize_t bytes;
size_t offs;
int npages;
- if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
- pages = stack_pages;
- bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
- nr_vecs, &offs, extraction_flags);
- } else {
- bytes = iov_iter_get_pages_alloc(iter, &pages,
- LONG_MAX, &offs, extraction_flags);
- }
+ if (nr_vecs > ARRAY_SIZE(stack_pages))
+ pages = NULL;
+
+ bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+ nr_vecs, extraction_flags, &offs);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
@@ -318,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (!bio_add_hw_page(rq->q, bio, page, n, offs,
max_sectors, &same_page)) {
if (same_page)
- put_page(page);
+ bio_release_page(bio, page);
break;
}
@@ -330,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
* release the pages we didn't map into the bio, if any
*/
while (j < npages)
- put_page(pages[j++]);
+ bio_release_page(bio, pages[j++]);
if (pages != stack_pages)
kvfree(pages);
/* couldn't stuff something into bio? */
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH v21 6/6] block: convert bio_map_user_iov to use iov_iter_extract_pages
2023-05-22 20:57 ` [PATCH v21 6/6] block: convert bio_map_user_iov " David Howells
@ 2023-05-23 8:14 ` Jan Kara
0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2023-05-23 8:14 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Christoph Hellwig,
John Hubbard
On Mon 22-05-23 21:57:44, David Howells wrote:
> This will pin pages or leave them unaltered rather than getting a ref on
> them as appropriate to the iterator.
>
> The pages need to be pinned for DIO rather than having refs taken on them
> to prevent VM copy-on-write from malfunctioning during a concurrent fork()
> (the result of the I/O could otherwise end up being visible to/affected by
> the child process).
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Jan Kara <jack@suse.cz>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Logan Gunthorpe <logang@deltatee.com>
> cc: linux-block@vger.kernel.org
> ---
Looks good. Feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
>
> Notes:
> ver #10)
> - Drop bio_set_cleanup_mode(), open coding it instead.
>
> ver #8)
> - Split the patch up a bit [hch].
> - We should only be using pinned/non-pinned pages and not ref'd pages,
> so adjust the comments appropriately.
>
> ver #7)
> - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
>
> ver #5)
> - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
> BIO_* flags and got rid of bi_cleanup_mode.
> - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.
>
> block/blk-map.c | 23 +++++++++++------------
> 1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 33d9f6e89ba6..3551c3ff17cf 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -281,22 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>
> if (blk_queue_pci_p2pdma(rq->q))
> extraction_flags |= ITER_ALLOW_P2PDMA;
> + if (iov_iter_extract_will_pin(iter))
> + bio_set_flag(bio, BIO_PAGE_PINNED);
>
> - bio_set_flag(bio, BIO_PAGE_REFFED);
> while (iov_iter_count(iter)) {
> - struct page **pages, *stack_pages[UIO_FASTIOV];
> + struct page *stack_pages[UIO_FASTIOV];
> + struct page **pages = stack_pages;
> ssize_t bytes;
> size_t offs;
> int npages;
>
> - if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
> - pages = stack_pages;
> - bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
> - nr_vecs, &offs, extraction_flags);
> - } else {
> - bytes = iov_iter_get_pages_alloc(iter, &pages,
> - LONG_MAX, &offs, extraction_flags);
> - }
> + if (nr_vecs > ARRAY_SIZE(stack_pages))
> + pages = NULL;
> +
> + bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
> + nr_vecs, extraction_flags, &offs);
> if (unlikely(bytes <= 0)) {
> ret = bytes ? bytes : -EFAULT;
> goto out_unmap;
> @@ -318,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
> if (!bio_add_hw_page(rq->q, bio, page, n, offs,
> max_sectors, &same_page)) {
> if (same_page)
> - put_page(page);
> + bio_release_page(bio, page);
> break;
> }
>
> @@ -330,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
> * release the pages we didn't map into the bio, if any
> */
> while (j < npages)
> - put_page(pages[j++]);
> + bio_release_page(bio, pages[j++]);
> if (pages != stack_pages)
> kvfree(pages);
> /* couldn't stuff something into bio? */
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 0/6] block: Use page pinning
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
` (5 preceding siblings ...)
2023-05-22 20:57 ` [PATCH v21 6/6] block: convert bio_map_user_iov " David Howells
@ 2023-05-23 6:39 ` Christoph Hellwig
2023-05-23 20:16 ` Extending page pinning into fs/direct-io.c David Howells
2023-05-23 21:38 ` [PATCH v21 0/6] block: Use page pinning Jens Axboe
8 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-05-23 6:39 UTC (permalink / raw)
To: David Howells
Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm
On Mon, May 22, 2023 at 09:57:38PM +0100, David Howells wrote:
> Hi Jens, Al, Christoph,
>
> This patchset rolls page-pinning out to the bio struct and the block layer,
> using iov_iter_extract_pages() to get pages and noting with BIO_PAGE_PINNED
> if the data pages attached to a bio are pinned. If the data pages come
> from a non-user-backed iterator, then the pages are left unpinned and
> unref'd, relying on whoever set up the I/O to do the retaining.
I think I already review the patches, so nothing new here. But can
you please also take care of the legacy direct I/O code? I'd really
hate to leave yet another unfinished transition around.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Extending page pinning into fs/direct-io.c
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
` (6 preceding siblings ...)
2023-05-23 6:39 ` [PATCH v21 0/6] block: Use page pinning Christoph Hellwig
@ 2023-05-23 20:16 ` David Howells
2023-05-24 5:55 ` Christoph Hellwig
` (2 more replies)
2023-05-23 21:38 ` [PATCH v21 0/6] block: Use page pinning Jens Axboe
8 siblings, 3 replies; 34+ messages in thread
From: David Howells @ 2023-05-23 20:16 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm
Christoph Hellwig <hch@infradead.org> wrote:
> But can you please also take care of the legacy direct I/O code? I'd really
> hate to leave yet another unfinished transition around.
I've been poking at it this afternoon, but it doesn't look like it's going to
be straightforward, unfortunately. The mm folks have been withdrawing access
to the pinning API behind the ramparts of the mm/ dir. Further, the dio code
will (I think), under some circumstances, arbitrarily insert the zero_page
into a list of things that are maybe pinned or maybe unpinned, but I can (I
think) also be given a pinned zero_page from the GUP code if the page tables
point to one and a DIO-write is requested - so just doing if page == zero_page
isn't sufficient.
What I'd like to do is to make the GUP code not take a ref on the zero_page
if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup
code always ignore the zero_page.
Alternatively, I can drop the pin immediately if I get given one on the
zero_page - it's not going anywhere, after all.
I also need to be able to take an additional pin on a folio that gets split
across multiple bio submissions to replace the get_page() that's there now.
Alternatively to that, I can decide how much data I'm willing to read/write in
one batch, call something like netfs_extract_user_iter() to decant that
portion of the parameter iterator into an bvec[] and let that look up the
overlapping page multiple times. However, I'm not sure if this would work
well for a couple of reasons: does a single bio have to refer to a contiguous
range of disk blocks? and we might expend time on getting pages we then have
to give up because we hit a hole.
Something that I noticed is that the dio code seems to wangle to page bits on
the target pages for a DIO-read, which seems odd, but I'm not sure I fully
understand the code yet.
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-23 20:16 ` Extending page pinning into fs/direct-io.c David Howells
@ 2023-05-24 5:55 ` Christoph Hellwig
2023-05-24 7:06 ` David Hildenbrand
2023-05-24 8:47 ` David Howells
2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-05-24 5:55 UTC (permalink / raw)
To: David Howells
Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm
On Tue, May 23, 2023 at 09:16:11PM +0100, David Howells wrote:
> I've been poking at it this afternoon, but it doesn't look like it's going to
> be straightforward, unfortunately. The mm folks have been withdrawing access
> to the pinning API behind the ramparts of the mm/ dir. Further, the dio code
> will (I think), under some circumstances, arbitrarily insert the zero_page
> into a list of things that are maybe pinned or maybe unpinned, but I can (I
> think) also be given a pinned zero_page from the GUP code if the page tables
> point to one and a DIO-write is requested - so just doing if page == zero_page
> isn't sufficient.
Yes. I think the proper workaround is to add a MM helper that just
pins a single page and make it available to direct-io.c. It should not
be exported and clearly marked to not be used in new code.
> What I'd like to do is to make the GUP code not take a ref on the zero_page
> if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup
> code always ignore the zero_page.
I don't think that'll work, as we can't mix different pin vs get types
in a bio. And that's really a good thing.
> Something that I noticed is that the dio code seems to wangle to page bits on
> the target pages for a DIO-read, which seems odd, but I'm not sure I fully
> understand the code yet.
I don't understand this sentence.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-23 20:16 ` Extending page pinning into fs/direct-io.c David Howells
2023-05-24 5:55 ` Christoph Hellwig
@ 2023-05-24 7:06 ` David Hildenbrand
2023-05-24 8:47 ` David Howells
2 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2023-05-24 7:06 UTC (permalink / raw)
To: David Howells, Christoph Hellwig
Cc: Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton,
Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
linux-kernel, linux-mm
On 23.05.23 22:16, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
>
>> But can you please also take care of the legacy direct I/O code? I'd really
>> hate to leave yet another unfinished transition around.
>
> I've been poking at it this afternoon, but it doesn't look like it's going to
> be straightforward, unfortunately. The mm folks have been withdrawing access
> to the pinning API behind the ramparts of the mm/ dir. Further, the dio code
> will (I think), under some circumstances, arbitrarily insert the zero_page
> into a list of things that are maybe pinned or maybe unpinned, but I can (I
> think) also be given a pinned zero_page from the GUP code if the page tables
> point to one and a DIO-write is requested - so just doing if page == zero_page
> isn't sufficient.
>
> What I'd like to do is to make the GUP code not take a ref on the zero_page
> if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup
> code always ignore the zero_page.
We discussed doing that unconditionally in the context of vfio (below), but vfio
decided to add a workaround suitable for stable.
In case of FOLL_PIN it's simple: if we detect the zeropage, don't mess with the
refcount when pinning and don't mess with the refcount when unpinning (esp.
unpin_user_pages). FOLL_GET is a different story but we don't have to mess with
that.
So there shouldn't be need for a FOLL_DONT_PIN_ZEROPAGE, we could just do it
unconditionally.
>
> Alternatively, I can drop the pin immediately if I get given one on the
> zero_page - it's not going anywhere, after all.
That's what vfio did in
commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4
Author: Alex Williamson <alex.williamson@redhat.com>
Date: Mon Aug 29 21:05:40 2022 -0600
vfio/type1: Unpin zero pages
There's currently a reference count leak on the zero page. We increment
the reference via pin_user_pages_remote(), but the page is later handled
as an invalid/reserved page, therefore it's not accounted against the
user and not unpinned by our put_pfn().
Introducing special zero page handling in put_pfn() would resolve the
leak, but without accounting of the zero page, a single user could
still create enough mappings to generate a reference count overflow.
The zero page is always resident, so for our purposes there's no reason
to keep it pinned. Therefore, add a loop to walk pages returned from
pin_user_pages_remote() and unpin any zero pages.
For vfio that handling no longer required, because FOLL_LONGTERM will never pin
the shared zeropage.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-23 20:16 ` Extending page pinning into fs/direct-io.c David Howells
2023-05-24 5:55 ` Christoph Hellwig
2023-05-24 7:06 ` David Hildenbrand
@ 2023-05-24 8:47 ` David Howells
2023-05-25 9:51 ` Christoph Hellwig
` (2 more replies)
2 siblings, 3 replies; 34+ messages in thread
From: David Howells @ 2023-05-24 8:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm
Christoph Hellwig <hch@infradead.org> wrote:
> > What I'd like to do is to make the GUP code not take a ref on the zero_page
> > if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup
> > code always ignore the zero_page.
>
> I don't think that'll work, as we can't mix different pin vs get types
> in a bio. And that's really a good thing.
True - but I was thinking of just treating the zero_page specially and never
hold a pin or a ref on it. It can be checked by address, e.g.:
static inline void bio_release_page(struct bio *bio, struct page *page)
{
if (page == ZERO_PAGE(0))
return;
if (bio_flagged(bio, BIO_PAGE_PINNED))
unpin_user_page(page);
else if (bio_flagged(bio, BIO_PAGE_REFFED))
put_page(page);
}
I'm slightly concerned about the possibility of overflowing the refcount. The
problem is that it only takes about 2 million pins to do that (because the
zero_page isn't a large folio) - which is within reach of userspace. Create
an 8GiB anon mmap and do a bunch of async DIO writes from it. You won't hit
ENOMEM because it will stick ~2 million pointers to zero_page into the page
tables.
> > Something that I noticed is that the dio code seems to wangle to page bits on
> > the target pages for a DIO-read, which seems odd, but I'm not sure I fully
> > understand the code yet.
>
> I don't understand this sentence.
I was looking at this:
static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
{
...
if (dio->is_async && dio_op == REQ_OP_READ && dio->should_dirty)
bio_set_pages_dirty(bio);
...
}
but looking again, the lock is taken briefly and the dirty bit is set - which
is reasonable. However, should we be doing it before starting the I/O?
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-24 8:47 ` David Howells
@ 2023-05-25 9:51 ` Christoph Hellwig
2023-05-25 16:31 ` Linus Torvalds
2023-05-25 17:00 ` David Howells
2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-05-25 9:51 UTC (permalink / raw)
To: David Howells
Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm
On Wed, May 24, 2023 at 09:47:10AM +0100, David Howells wrote:
> True - but I was thinking of just treating the zero_page specially and never
> hold a pin or a ref on it. It can be checked by address, e.g.:
>
> static inline void bio_release_page(struct bio *bio, struct page *page)
> {
> if (page == ZERO_PAGE(0))
> return;
> if (bio_flagged(bio, BIO_PAGE_PINNED))
> unpin_user_page(page);
> else if (bio_flagged(bio, BIO_PAGE_REFFED))
> put_page(page);
> }
That does sound good as well to me.
> I was looking at this:
>
> static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
> {
> ...
> if (dio->is_async && dio_op == REQ_OP_READ && dio->should_dirty)
> bio_set_pages_dirty(bio);
> ...
> }
>
> but looking again, the lock is taken briefly and the dirty bit is set - which
> is reasonable. However, should we be doing it before starting the I/O?
It is done before starting the I/O - the submit_bio is just below this.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-24 8:47 ` David Howells
2023-05-25 9:51 ` Christoph Hellwig
@ 2023-05-25 16:31 ` Linus Torvalds
2023-05-25 16:45 ` David Hildenbrand
2023-05-25 17:07 ` David Howells
2023-05-25 17:00 ` David Howells
2 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2023-05-25 16:31 UTC (permalink / raw)
To: David Howells
Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, linux-fsdevel, linux-block,
linux-kernel, linux-mm
On Wed, May 24, 2023 at 1:47 AM David Howells <dhowells@redhat.com> wrote:
>
> True - but I was thinking of just treating the zero_page specially and never
> hold a pin or a ref on it. It can be checked by address, e.g.:
>
> static inline void bio_release_page(struct bio *bio, struct page *page)
> {
> if (page == ZERO_PAGE(0))
> return;
That won't actually work.
We do have cases that try to use the page coloring that we support.
Admittedly it seems to be only rmda that does it directly with
something like this:
vmf->page = ZERO_PAGE(vmf->address);
but you can get arbitrary zero pages by pinning or GUPing them from
user space mappings.
Now, the only architectures that *use* multiple zero pages are - I
think - MIPS (including Loongarch) and s390.
So it's rare, but it does happen.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 16:31 ` Linus Torvalds
@ 2023-05-25 16:45 ` David Hildenbrand
2023-05-25 17:04 ` Linus Torvalds
2023-05-25 17:15 ` David Howells
2023-05-25 17:07 ` David Howells
1 sibling, 2 replies; 34+ messages in thread
From: David Hildenbrand @ 2023-05-25 16:45 UTC (permalink / raw)
To: Linus Torvalds, David Howells
Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
Christian Brauner, linux-fsdevel, linux-block, linux-kernel,
linux-mm
On 25.05.23 18:31, Linus Torvalds wrote:
> On Wed, May 24, 2023 at 1:47 AM David Howells <dhowells@redhat.com> wrote:
>>
>> True - but I was thinking of just treating the zero_page specially and never
>> hold a pin or a ref on it. It can be checked by address, e.g.:
>>
>> static inline void bio_release_page(struct bio *bio, struct page *page)
>> {
>> if (page == ZERO_PAGE(0))
>> return;
>
> That won't actually work.
>
> We do have cases that try to use the page coloring that we support.
>
> Admittedly it seems to be only rmda that does it directly with
> something like this:
>
> vmf->page = ZERO_PAGE(vmf->address);
>
> but you can get arbitrary zero pages by pinning or GUPing them from
> user space mappings.
>
> Now, the only architectures that *use* multiple zero pages are - I
> think - MIPS (including Loongarch) and s390.
>
> So it's rare, but it does happen.
I think the correct way to test for a zero page is
is_zero_pfn(page_to_pfn(page).
Using my_zero_pfn(vmf->address) in do_anonymous_page() these can easily
end up in any process.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 16:45 ` David Hildenbrand
@ 2023-05-25 17:04 ` Linus Torvalds
2023-05-25 17:15 ` David Howells
1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2023-05-25 17:04 UTC (permalink / raw)
To: David Hildenbrand
Cc: David Howells, Christoph Hellwig, Jens Axboe, Al Viro,
Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
linux-block, linux-kernel, linux-mm
On Thu, May 25, 2023 at 9:45 AM David Hildenbrand <david@redhat.com> wrote:
>
> I think the correct way to test for a zero page is
> is_zero_pfn(page_to_pfn(page).
Yeah. Except it's really ugly and strange, and we should probably add
a helper for that pattern.
The reason it has that odd "look at pfn" is just because I think the
first users were in the page table code, which had the pfn already,
and the test is basically based on the zero_page_mask thing that the
affected architectures have.
So I suspect we should add that
is_zero_pfn(page_to_pfn(page))
as a helper inline function rather than write it out even more times
(that "is this 'struct page' a zero page" pattern already exists in
/proc and a few other places.
is_longterm_pinnable_page() already has it, so adding it as a helper
there in <linux/mm.h> is probably a good idea.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 16:45 ` David Hildenbrand
2023-05-25 17:04 ` Linus Torvalds
@ 2023-05-25 17:15 ` David Howells
2023-05-25 17:25 ` Linus Torvalds
1 sibling, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-25 17:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, David Hildenbrand, Christoph Hellwig, Jens Axboe,
Al Viro, Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
linux-block, linux-kernel, linux-mm
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> So I suspect we should add that
>
> is_zero_pfn(page_to_pfn(page))
>
> as a helper inline function rather than write it out even more times
> (that "is this 'struct page' a zero page" pattern already exists in
> /proc and a few other places.
>
> is_longterm_pinnable_page() already has it, so adding it as a helper
> there in <linux/mm.h> is probably a good idea.
I just added:
static inline bool IS_ZERO_PAGE(const struct page *page)
{
return is_zero_pfn(page_to_pfn(page));
}
static inline bool IS_ZERO_FOLIO(const struct folio *folio)
{
return is_zero_pfn(page_to_pfn((const struct page *)folio));
}
to include/linux/pgtable.h. It doesn't seem I can add it to mm.h as an inline
function.
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 17:15 ` David Howells
@ 2023-05-25 17:25 ` Linus Torvalds
0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2023-05-25 17:25 UTC (permalink / raw)
To: David Howells
Cc: David Hildenbrand, Christoph Hellwig, Jens Axboe, Al Viro,
Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
linux-block, linux-kernel, linux-mm
On Thu, May 25, 2023 at 10:15 AM David Howells <dhowells@redhat.com> wrote:
>
> It doesn't seem I can add it to mm.h as an inline function.
What? We already have that pattern inside is_longterm_pinnable_page(),
so that's really strange.
But regardless, please don't duplicate that odd conditional for no
reason, and don't scream.
So regardless of where it is, make that "is_zero_folio()" just do
"is_zero_page(&folio->page)" rather than repeat the question.
I also wonder whether we shouldn't just use the "transparent union"
argument thing more aggressively. Something like
typedef union {
struct page *page;
struct folio *folio;
} page_or_folio_t __attribute__ ((__transparent_union__));
and then you should be able to do something like this:
static inline bool is_zero_page(const page_or_folio_t arg)
{
return is_zero_pfn(page_to_pfn(arg.page));
}
and we don't have to keep generating the two versions over and over
and over again.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 16:31 ` Linus Torvalds
2023-05-25 16:45 ` David Hildenbrand
@ 2023-05-25 17:07 ` David Howells
2023-05-25 17:17 ` Linus Torvalds
1 sibling, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-25 17:07 UTC (permalink / raw)
To: David Hildenbrand
Cc: dhowells, Linus Torvalds, Christoph Hellwig, Jens Axboe, Al Viro,
Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
linux-block, linux-kernel, linux-mm
David Hildenbrand <david@redhat.com> wrote:
> I think the correct way to test for a zero page is
> is_zero_pfn(page_to_pfn(page).
>
> Using my_zero_pfn(vmf->address) in do_anonymous_page() these can easily end up
> in any process.
Should everywhere that is using ZERO_PAGE(0) actually be using my_zero_pfn()?
ZERO_PAGE() could do with a kdoc comment saying how to use it.
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 17:07 ` David Howells
@ 2023-05-25 17:17 ` Linus Torvalds
0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2023-05-25 17:17 UTC (permalink / raw)
To: David Howells
Cc: David Hildenbrand, Christoph Hellwig, Jens Axboe, Al Viro,
Matthew Wilcox, Jan Kara, Jeff Layton, Jason Gunthorpe,
Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
linux-block, linux-kernel, linux-mm
On Thu, May 25, 2023 at 10:07 AM David Howells <dhowells@redhat.com> wrote:
>
> Should everywhere that is using ZERO_PAGE(0) actually be using my_zero_pfn()?
No, that would just make code uglier for no reason, because then you
have to turn that pfn into a virtual address.
So if what you *want* is a pfn to begin with, then use, use my_zero_pfn().
But if what you want is just the virtual address, use ZERO_PAGE().
And if you are going to map it at some address, give it the address
you're going to use, otherwise just do zero for "whatever".
The only thing you can't use ZERO_PAGE(0) for is literally that "is
this a zero page" address comparison, because ZERO_PAGE(0) is just
_one_ address.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-24 8:47 ` David Howells
2023-05-25 9:51 ` Christoph Hellwig
2023-05-25 16:31 ` Linus Torvalds
@ 2023-05-25 17:00 ` David Howells
2023-05-25 17:13 ` Linus Torvalds
2 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-05-25 17:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: dhowells, Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox,
Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
linux-block, linux-kernel, linux-mm
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> We do have cases that try to use the page coloring that we support.
What do we gain from it? Presumably since nothing is supposed to write to
that page, it can be shared in all the caches.
David
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: Extending page pinning into fs/direct-io.c
2023-05-25 17:00 ` David Howells
@ 2023-05-25 17:13 ` Linus Torvalds
0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2023-05-25 17:13 UTC (permalink / raw)
To: David Howells
Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
Hillf Danton, Christian Brauner, linux-fsdevel, linux-block,
linux-kernel, linux-mm
On Thu, May 25, 2023 at 10:01 AM David Howells <dhowells@redhat.com> wrote:
>
> What do we gain from it? Presumably since nothing is supposed to write to
> that page, it can be shared in all the caches.
I don't remember the details, but they went something like "broken
purely virtually indexed cache avoids physical aliases by cacheline
exclusion at fill time".
Which then meant that if you walk a zero mapping, you'll invalidate
the caches of the previous page when you walk the next one. Causing
horrendously bad performance.
Unless it's colored.
Something like that. I probably got all the details wrong.
Linus
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 0/6] block: Use page pinning
2023-05-22 20:57 [PATCH v21 0/6] block: Use page pinning David Howells
` (7 preceding siblings ...)
2023-05-23 20:16 ` Extending page pinning into fs/direct-io.c David Howells
@ 2023-05-23 21:38 ` Jens Axboe
2023-05-24 5:52 ` Christoph Hellwig
2023-05-24 7:35 ` David Howells
8 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2023-05-23 21:38 UTC (permalink / raw)
To: Al Viro, Christoph Hellwig, David Howells
Cc: Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
linux-fsdevel, linux-block, linux-kernel, linux-mm,
Jason Gunthorpe
On Mon, 22 May 2023 21:57:38 +0100, David Howells wrote:
> This patchset rolls page-pinning out to the bio struct and the block layer,
> using iov_iter_extract_pages() to get pages and noting with BIO_PAGE_PINNED
> if the data pages attached to a bio are pinned. If the data pages come
> from a non-user-backed iterator, then the pages are left unpinned and
> unref'd, relying on whoever set up the I/O to do the retaining.
>
> This requires the splice-read patchset to have been applied first,
> otherwise reversion of the ITER_PAGE iterator can race with truncate and
> return pages to the allocator whilst they're still undergoing DMA[2].
>
> [...]
Applied, thanks!
[1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
commit: 9e73bb36b189ec73c7062ec974e0ff287c1aa152
[2/6] block: Fix bio_flagged() so that gcc can better optimise it
commit: b9cc607a7f722c374540b2a7c973382592196549
[3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
commit: 100ae68dac60a0688082dcaf3e436606ec0fd51f
[4/6] block: Add BIO_PAGE_PINNED and associated infrastructure
commit: 84d9fe8b7ea6a53fd93506583ff33a408f95ac60
[5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
commit: b7c96963925fe08d4ef175b7d438c0017155807c
[6/6] block: convert bio_map_user_iov to use iov_iter_extract_pages
commit: 36b61bb07963b13de4cc03a945aa25b9ffc7d003
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 0/6] block: Use page pinning
2023-05-23 21:38 ` [PATCH v21 0/6] block: Use page pinning Jens Axboe
@ 2023-05-24 5:52 ` Christoph Hellwig
2023-05-24 14:43 ` Jens Axboe
2023-05-24 7:35 ` David Howells
1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2023-05-24 5:52 UTC (permalink / raw)
To: Jens Axboe
Cc: Al Viro, Christoph Hellwig, David Howells, Matthew Wilcox,
Jan Kara, Jeff Layton, David Hildenbrand, Logan Gunthorpe,
Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
linux-block, linux-kernel, linux-mm, Jason Gunthorpe
On Tue, May 23, 2023 at 03:38:31PM -0600, Jens Axboe wrote:
> Applied, thanks!
This ended up on the for-6.5/block branch, but I think it needs to be
on the splice one, as that is pre-requisite unless I'm missing
something.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 0/6] block: Use page pinning
2023-05-24 5:52 ` Christoph Hellwig
@ 2023-05-24 14:43 ` Jens Axboe
0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2023-05-24 14:43 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Al Viro, David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
David Hildenbrand, Logan Gunthorpe, Hillf Danton,
Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
linux-kernel, linux-mm, Jason Gunthorpe
On 5/23/23 11:52 PM, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:38:31PM -0600, Jens Axboe wrote:
>> Applied, thanks!
>
> This ended up on the for-6.5/block branch, but I think it needs to be
> on the splice one, as that is pre-requisite unless I'm missing
> something.
Oops yes, that's my bad. I've reshuffled things now so that they should
make more sense.
--
Jens Axboe
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v21 0/6] block: Use page pinning
2023-05-23 21:38 ` [PATCH v21 0/6] block: Use page pinning Jens Axboe
2023-05-24 5:52 ` Christoph Hellwig
@ 2023-05-24 7:35 ` David Howells
1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2023-05-24 7:35 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
Jeff Layton, David Hildenbrand, Logan Gunthorpe, Hillf Danton,
Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
linux-kernel, linux-mm, Jason Gunthorpe
Christoph Hellwig <hch@infradead.org> wrote:
> > Applied, thanks!
>
> This ended up on the for-6.5/block branch, but I think it needs to be
> on the splice one, as that is pre-requisite unless I'm missing
> something.
Indeed. As I noted in the cover note:
This requires the splice-read patchset to have been applied first,
otherwise reversion of the ITER_PAGE iterator can race with truncate and
return pages to the allocator whilst they're still undergoing DMA[2].
David
^ permalink raw reply [flat|nested] 34+ messages in thread