All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/7] dma mapping optimisations
@ 2022-08-05 16:24 Keith Busch
  2022-08-05 16:24 ` [PATCHv3 1/7] blk-mq: add ops to dma map bvec Keith Busch
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Changes since v2:

  Fixed incorrect io_uring io_fixed_file index validit checksy: this should
  have been validating the file_ptr (Ammar)

  Various micro-optimizations: move up dma in iov type checks, skip
  iov_iter_advance on async IO (Jens).

  NVMe driver cleanups splitting the fast and slow paths.

  NVMe driver prp list setup fixes when using the slow path.

Summary:

A user address undergoes various represenations for a typical read or
write command. Each consumes memory and CPU cycles. When the backing
storage is NVMe, the sequence looks something like the following:

  __user void *
  struct iov_iter
  struct pages[]
  struct bio_vec[]
  struct scatterlist[]
  __le64[]

Applications will often use the same buffer for many IO, so these
potentially costly per-IO transformations to reach the exact same
hardware descriptor can be skipped.

The io_uring interface already provides a way for users to register
buffers to get to 'struct bio_vec[]'. That still leaves the scatterlist
needed for the repeated dma_map_sg(), then transform to nvme's PRP list
format.

This series takes the registered buffers a step further. A block driver
can implement a new .dma_map() callback to reach the hardware's DMA
mapped address format, and return a cookie so a user can reference it
later for any given IO. When used, the block stack can skip significant
amounts of code, improving CPU utilization and IOPs.

The implementation is currently limited to mapping a registered buffer
to a single io_uring fixed file.

Keith Busch (7):
  blk-mq: add ops to dma map bvec
  file: add ops to dma map bvec
  iov_iter: introduce type for preregistered dma tags
  block: add dma tag bio type
  io_uring: introduce file slot release helper
  io_uring: add support for dma pre-mapping
  nvme-pci: implement dma_map support

 block/bdev.c                   |  20 +++
 block/bio.c                    |  24 ++-
 block/blk-merge.c              |  19 ++
 block/fops.c                   |  24 ++-
 drivers/nvme/host/pci.c        | 314 +++++++++++++++++++++++++++++++--
 fs/file.c                      |  15 ++
 include/linux/bio.h            |  22 ++-
 include/linux/blk-mq.h         |  24 +++
 include/linux/blk_types.h      |   6 +-
 include/linux/blkdev.h         |  16 ++
 include/linux/fs.h             |  20 +++
 include/linux/io_uring_types.h |   2 +
 include/linux/uio.h            |   9 +
 include/uapi/linux/io_uring.h  |  12 ++
 io_uring/filetable.c           |  34 ++--
 io_uring/filetable.h           |  10 +-
 io_uring/io_uring.c            | 139 +++++++++++++++
 io_uring/net.c                 |   2 +-
 io_uring/rsrc.c                |  27 +--
 io_uring/rsrc.h                |  10 +-
 io_uring/rw.c                  |   2 +-
 lib/iov_iter.c                 |  27 ++-
 22 files changed, 724 insertions(+), 54 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCHv3 1/7] blk-mq: add ops to dma map bvec
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-05 16:24 ` [PATCHv3 2/7] file: " Keith Busch
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The same buffer may be used for many subsequent IO's. Instead of setting
up the mapping per-IO, provide an interface that can allow a buffer to be
premapped just once and referenced again later.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bdev.c           | 20 ++++++++++++++++++++
 include/linux/blk-mq.h | 13 +++++++++++++
 include/linux/blkdev.h | 16 ++++++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index ce05175e71ce..c3d73ad86fae 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1069,3 +1069,23 @@ void sync_bdevs(bool wait)
 	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
+
+#ifdef CONFIG_HAS_DMA
+void *block_dma_map(struct block_device *bdev, struct bio_vec *bvec,
+		    int nr_vecs)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q->mq_ops && q->mq_ops->dma_map)
+		return q->mq_ops->dma_map(q, bvec, nr_vecs);
+	return ERR_PTR(-EINVAL);
+}
+
+void block_dma_unmap(struct block_device *bdev, void *dma_tag)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q->mq_ops && q->mq_ops->dma_unmap)
+		return q->mq_ops->dma_unmap(q, dma_tag);
+}
+#endif
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index effee1dc715a..e10aabb36c2c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -639,6 +639,19 @@ struct blk_mq_ops {
 	 */
 	void (*show_rq)(struct seq_file *m, struct request *rq);
 #endif
+
+#ifdef CONFIG_HAS_DMA
+	/**
+	 * @dma_map: Create a dma mapping. On success, returns an opaque cookie
+	 * that the can be referenced by the driver in future requests.
+	 */
+	void *(*dma_map)(struct request_queue *q, struct bio_vec *bvec, int nr_vecs);
+
+	/**
+	 * @dma_unmap: Tear down a previously created dma mapping.
+	 */
+	void (*dma_unmap)(struct request_queue *q, void *dma_tag);
+#endif
 };
 
 enum {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 49dcd31e283e..efc5e805a46e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1527,4 +1527,20 @@ struct io_comp_batch {
 
 #define DEFINE_IO_COMP_BATCH(name)	struct io_comp_batch name = { }
 
+#ifdef CONFIG_HAS_DMA
+void *block_dma_map(struct block_device *bdev, struct bio_vec *bvec,
+		    int nr_vecs);
+void block_dma_unmap(struct block_device *bdev, void *dma_tag);
+#else
+static inline void *block_dma_map(struct block_device *bdev,
+				  struct bio_vec *bvec, int nr_vecs)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void block_dma_unmap(struct block_device *bdev, void *dma_tag)
+{
+}
+#endif
+
 #endif /* _LINUX_BLKDEV_H */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
  2022-08-05 16:24 ` [PATCHv3 1/7] blk-mq: add ops to dma map bvec Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-08  0:21   ` Dave Chinner
  2022-08-05 16:24 ` [PATCHv3 3/7] iov_iter: introduce type for preregistered dma tags Keith Busch
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

The same buffer may be used for many subsequent IO's. Instead of setting
up the mapping per-IO, provide an interface that can allow a buffer to
be premapped just once and referenced again later, and implement for the
block device file.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/fops.c       | 20 ++++++++++++++++++++
 fs/file.c          | 15 +++++++++++++++
 include/linux/fs.h | 20 ++++++++++++++++++++
 3 files changed, 55 insertions(+)

diff --git a/block/fops.c b/block/fops.c
index 29066ac5a2fa..db2d1e848f4b 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -670,6 +670,22 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	return error;
 }
 
+#ifdef CONFIG_HAS_DMA
+void *blkdev_dma_map(struct file *filp, struct bio_vec *bvec, int nr_vecs)
+{
+	struct block_device *bdev = filp->private_data;
+
+	return block_dma_map(bdev, bvec, nr_vecs);
+}
+
+void blkdev_dma_unmap(struct file *filp, void *dma_tag)
+{
+	struct block_device *bdev = filp->private_data;
+
+	return block_dma_unmap(bdev, dma_tag);
+}
+#endif
+
 const struct file_operations def_blk_fops = {
 	.open		= blkdev_open,
 	.release	= blkdev_close,
@@ -686,6 +702,10 @@ const struct file_operations def_blk_fops = {
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= blkdev_fallocate,
+#ifdef CONFIG_HAS_DMA
+	.dma_map	= blkdev_dma_map,
+	.dma_unmap	= blkdev_dma_unmap,
+#endif
 };
 
 static __init int blkdev_init(void)
diff --git a/fs/file.c b/fs/file.c
index 3bcc1ecc314a..767bf9d3205e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1307,3 +1307,18 @@ int iterate_fd(struct files_struct *files, unsigned n,
 	return res;
 }
 EXPORT_SYMBOL(iterate_fd);
+
+#ifdef CONFIG_HAS_DMA
+void *file_dma_map(struct file *file, struct bio_vec *bvec, int nr_vecs)
+{
+	if (file->f_op->dma_map)
+		return file->f_op->dma_map(file, bvec, nr_vecs);
+	return ERR_PTR(-EINVAL);
+}
+
+void file_dma_unmap(struct file *file, void *dma_tag)
+{
+	if (file->f_op->dma_unmap)
+		return file->f_op->dma_unmap(file, dma_tag);
+}
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9f131e559d05..8652bad763f3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2092,6 +2092,10 @@ struct dir_context {
 struct iov_iter;
 struct io_uring_cmd;
 
+#ifdef CONFIG_HAS_DMA
+struct bio_vec;
+#endif
+
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -2134,6 +2138,10 @@ struct file_operations {
 				   loff_t len, unsigned int remap_flags);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
+#ifdef CONFIG_HAS_DMA
+	void *(*dma_map)(struct file *, struct bio_vec *, int);
+	void (*dma_unmap)(struct file *, void *);
+#endif
 } __randomize_layout;
 
 struct inode_operations {
@@ -3595,4 +3603,16 @@ extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
 			   int advice);
 
+#ifdef CONFIG_HAS_DMA
+void *file_dma_map(struct file *file, struct bio_vec *bvec, int nr_vecs);
+void file_dma_unmap(struct file *file, void *dma_tag);
+#else
+static inline void *file_dma_map(struct file *file, struct bio_vec *bvec,
+				 int nr_vecs)
+{
+	return ERR_PTR(-ENOTSUPP);
+}
+static inline void file_dma_unmap(struct file *file, void *dma_tag) {}
+#endif
+
 #endif /* _LINUX_FS_H */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv3 3/7] iov_iter: introduce type for preregistered dma tags
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
  2022-08-05 16:24 ` [PATCHv3 1/7] blk-mq: add ops to dma map bvec Keith Busch
  2022-08-05 16:24 ` [PATCHv3 2/7] file: " Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-05 16:24 ` [PATCHv3 4/7] block: add dma tag bio type Keith Busch
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Introduce a new iov_iter type representing a pre-registered DMA address
tag. The tag is an opaque cookie specific to the lower level driver that
created it, and can be referenced at any arbitrary offset.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/linux/uio.h |  9 +++++++++
 lib/iov_iter.c      | 27 ++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 34ba4a731179..a55e4b86413a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -26,6 +26,7 @@ enum iter_type {
 	ITER_PIPE,
 	ITER_XARRAY,
 	ITER_DISCARD,
+	ITER_DMA_TAG,
 };
 
 struct iov_iter_state {
@@ -46,6 +47,7 @@ struct iov_iter {
 		const struct bio_vec *bvec;
 		struct xarray *xarray;
 		struct pipe_inode_info *pipe;
+		void *dma_tag;
 	};
 	union {
 		unsigned long nr_segs;
@@ -85,6 +87,11 @@ static inline bool iov_iter_is_bvec(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_BVEC;
 }
 
+static inline bool iov_iter_is_dma_tag(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_DMA_TAG;
+}
+
 static inline bool iov_iter_is_pipe(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_PIPE;
@@ -229,6 +236,8 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
+void iov_iter_dma_tag(struct iov_iter *i, unsigned int direction, void *dma_tag,
+			unsigned int dma_offset, size_t count);
 void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
 			size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 507e732ef7cf..ebdf81473526 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1070,6 +1070,9 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
+	} else if (iov_iter_is_dma_tag(i)) {
+		i->iov_offset += size;
+		i->count -= size;
 	} else if (iov_iter_is_pipe(i)) {
 		pipe_advance(i, size);
 	} else if (unlikely(iov_iter_is_xarray(i))) {
@@ -1201,6 +1204,21 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+void iov_iter_dma_tag(struct iov_iter *i, unsigned int direction,
+			void *dma_tag, unsigned int dma_offset,
+			size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i = (struct iov_iter){
+		.iter_type = ITER_DMA_TAG,
+		.data_source = direction,
+		.dma_tag = dma_tag,
+		.iov_offset = dma_offset,
+		.count = count
+	};
+}
+EXPORT_SYMBOL(iov_iter_dma_tag);
+
 void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
 			struct pipe_inode_info *pipe,
 			size_t count)
@@ -1335,6 +1353,9 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 	if (iov_iter_is_bvec(i))
 		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
 
+	if (iov_iter_is_dma_tag(i))
+		return !(i->iov_offset & addr_mask);
+
 	if (iov_iter_is_pipe(i)) {
 		unsigned int p_mask = i->pipe->ring_size - 1;
 		size_t size = i->count;
@@ -2124,8 +2145,8 @@ EXPORT_SYMBOL(import_single_range);
  */
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 {
-	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
-			 !iov_iter_is_kvec(i))
+	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i) &&
+			 !iov_iter_is_dma_tag(i)) && !iov_iter_is_kvec(i))
 		return;
 	i->iov_offset = state->iov_offset;
 	i->count = state->count;
@@ -2141,7 +2162,7 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 	BUILD_BUG_ON(sizeof(struct iovec) != sizeof(struct kvec));
 	if (iov_iter_is_bvec(i))
 		i->bvec -= state->nr_segs - i->nr_segs;
-	else
+	else if (!iov_iter_is_dma_tag(i))
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv3 4/7] block: add dma tag bio type
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
                   ` (2 preceding siblings ...)
  2022-08-05 16:24 ` [PATCHv3 3/7] iov_iter: introduce type for preregistered dma tags Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-05 16:24 ` [PATCHv3 5/7] io_uring: introduce file slot release helper Keith Busch
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Premapped buffers don't require a generic bio_vec since these have
already been dma mapped to the driver specific data structure. Repurpose
the bi_io_vec with the driver specific tag as they are mutually
exclusive, and provide all the setup and split helpers to support dma
tags.

In order to use this, a driver must implement the .dma_map() blk-mq op.
If the driver provides this callback, then it must be aware that any
given bio may be using a dma_tag instead of a bio_vec.

Note, this isn't working with blk_integrity.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c               | 24 +++++++++++++++++++++++-
 block/blk-merge.c         | 19 +++++++++++++++++++
 block/fops.c              |  4 +++-
 include/linux/bio.h       | 22 +++++++++++++---------
 include/linux/blk-mq.h    | 11 +++++++++++
 include/linux/blk_types.h |  6 +++++-
 6 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index d6eb90d9b20b..c1e97dff5e40 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -229,7 +229,8 @@ static void bio_free(struct bio *bio)
 	WARN_ON_ONCE(!bs);
 
 	bio_uninit(bio);
-	bvec_free(&bs->bvec_pool, bio->bi_io_vec, bio->bi_max_vecs);
+	if (!bio_flagged(bio, BIO_DMA_TAGGED))
+		bvec_free(&bs->bvec_pool, bio->bi_io_vec, bio->bi_max_vecs);
 	mempool_free(p - bs->front_pad, &bs->bio_pool);
 }
 
@@ -762,6 +763,8 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 	bio_set_flag(bio, BIO_CLONED);
 	if (bio_flagged(bio_src, BIO_THROTTLED))
 		bio_set_flag(bio, BIO_THROTTLED);
+	if (bio_flagged(bio_src, BIO_DMA_TAGGED))
+		bio_set_flag(bio, BIO_DMA_TAGGED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1151,6 +1154,19 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
+void bio_iov_dma_tag_set(struct bio *bio, struct iov_iter *iter)
+{
+	size_t size = iov_iter_count(iter);
+
+	bio->bi_vcnt = iter->nr_segs;
+	bio->bi_dma_tag = iter->dma_tag;
+	bio->bi_iter.bi_bvec_done = iter->iov_offset;
+	bio->bi_iter.bi_size = size;
+	bio->bi_opf |= REQ_NOMERGE;
+	bio_set_flag(bio, BIO_NO_PAGE_REF);
+	bio_set_flag(bio, BIO_DMA_TAGGED);
+}
+
 static int bio_iov_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int offset)
 {
@@ -1287,6 +1303,12 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
+	if (iov_iter_is_dma_tag(iter)) {
+		bio_iov_dma_tag_set(bio, iter);
+		iov_iter_advance(iter, bio->bi_iter.bi_size);
+		return 0;
+	}
+
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-merge.c b/block/blk-merge.c
index ff04e9290715..d024885ad4c4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -274,6 +274,25 @@ static struct bio *bio_split_rw(struct bio *bio, struct queue_limits *lim,
 	struct bvec_iter iter;
 	unsigned nsegs = 0, bytes = 0;
 
+	if (bio_flagged(bio, BIO_DMA_TAGGED)) {
+		int offset = offset_in_page(bio->bi_iter.bi_bvec_done);
+
+		nsegs = ALIGN(bio->bi_iter.bi_size + offset, PAGE_SIZE) >>
+								PAGE_SHIFT;
+		if (bio->bi_iter.bi_size > max_bytes) {
+			bytes = max_bytes;
+			nsegs = (bytes + offset) >> PAGE_SHIFT;
+		} else if (nsegs > lim->max_segments) {
+			nsegs = lim->max_segments;
+			bytes = PAGE_SIZE * nsegs - offset;
+		} else {
+			*segs = nsegs;
+			return NULL;
+		}
+
+		goto split;
+	}
+
 	bio_for_each_bvec(bv, bio, iter) {
 		/*
 		 * If the queue doesn't support SG gaps and adding this
diff --git a/block/fops.c b/block/fops.c
index db2d1e848f4b..1b3649c7eb17 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -325,7 +325,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		 * bio_iov_iter_get_pages() and set the bvec directly.
 		 */
 		bio_iov_bvec_set(bio, iter);
-	} else {
+	} else if (iov_iter_is_dma_tag(iter)) {
+		bio_iov_dma_tag_set(bio, iter);
+	}else {
 		ret = bio_iov_iter_get_pages(bio, iter);
 		if (unlikely(ret)) {
 			bio_put(bio);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ca22b06700a9..b5277ec189e0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -61,11 +61,17 @@ static inline bool bio_has_data(struct bio *bio)
 	return false;
 }
 
+static inline bool bio_flagged(const struct bio *bio, unsigned int bit)
+{
+	return (bio->bi_flags & (1U << bit)) != 0;
+}
+
 static inline bool bio_no_advance_iter(const struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
-	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
+	       bio_op(bio) == REQ_OP_WRITE_ZEROES ||
+	       bio_flagged(bio, BIO_DMA_TAGGED);
 }
 
 static inline void *bio_data(struct bio *bio)
@@ -98,9 +104,11 @@ static inline void bio_advance_iter(const struct bio *bio,
 {
 	iter->bi_sector += bytes >> 9;
 
-	if (bio_no_advance_iter(bio))
+	if (bio_no_advance_iter(bio)) {
 		iter->bi_size -= bytes;
-	else
+		if (bio_flagged(bio, BIO_DMA_TAGGED))
+			iter->bi_bvec_done += bytes;
+	} else
 		bvec_iter_advance(bio->bi_io_vec, iter, bytes);
 		/* TODO: It is reasonable to complete bio with error here. */
 }
@@ -225,11 +233,6 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 	atomic_set(&bio->__bi_cnt, count);
 }
 
-static inline bool bio_flagged(struct bio *bio, unsigned int bit)
-{
-	return (bio->bi_flags & (1U << bit)) != 0;
-}
-
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)
 {
 	bio->bi_flags |= (1U << bit);
@@ -447,7 +450,7 @@ static inline void bio_wouldblock_error(struct bio *bio)
  */
 static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
 {
-	if (iov_iter_is_bvec(iter))
+	if (iov_iter_is_bvec(iter) || iov_iter_is_dma_tag(iter))
 		return 0;
 	return iov_iter_npages(iter, max_segs);
 }
@@ -471,6 +474,7 @@ void __bio_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
 void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter);
+void bio_iov_dma_tag_set(struct bio *bio, struct iov_iter *iter);
 void __bio_release_pages(struct bio *bio, bool mark_dirty);
 extern void bio_set_pages_dirty(struct bio *bio);
 extern void bio_check_pages_dirty(struct bio *bio);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e10aabb36c2c..f5e0aa61bf85 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1141,6 +1141,17 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 }
 void blk_dump_rq_flags(struct request *, char *);
 
+static inline void *blk_rq_dma_tag(struct request *rq)
+{
+	return rq->bio && bio_flagged(rq->bio, BIO_DMA_TAGGED) ?
+		rq->bio->bi_dma_tag : 0;
+}
+
+static inline size_t blk_rq_dma_offset(struct request *rq)
+{
+	return rq->bio->bi_iter.bi_bvec_done;
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline unsigned int blk_rq_zone_no(struct request *rq)
 {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1ef99790f6ed..ea6db439acbe 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -299,7 +299,10 @@ struct bio {
 
 	atomic_t		__bi_cnt;	/* pin count */
 
-	struct bio_vec		*bi_io_vec;	/* the actual vec list */
+	union {
+		struct bio_vec		*bi_io_vec;	/* the actual vec list */
+		void			*bi_dma_tag;    /* driver specific tag */
+	};
 
 	struct bio_set		*bi_pool;
 
@@ -334,6 +337,7 @@ enum {
 	BIO_QOS_MERGED,		/* but went through rq_qos merge path */
 	BIO_REMAPPED,
 	BIO_ZONE_WRITE_LOCKED,	/* Owns a zoned device zone write lock */
+	BIO_DMA_TAGGED,		/* Using premmaped dma buffers */
 	BIO_FLAG_LAST
 };
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv3 5/7] io_uring: introduce file slot release helper
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
                   ` (3 preceding siblings ...)
  2022-08-05 16:24 ` [PATCHv3 4/7] block: add dma tag bio type Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-05 16:24 ` [PATCHv3 6/7] io_uring: add support for dma pre-mapping Keith Busch
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Releasing the pre-registered file follows a repeated pattern. Introduce
a helper to make it easier to add more complexity to this resource in
the future.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 io_uring/filetable.c | 33 +++++++++++++++++++++------------
 io_uring/filetable.h |  3 +++
 io_uring/rsrc.c      |  5 +----
 3 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/io_uring/filetable.c b/io_uring/filetable.c
index 7b473259f3f4..1b8db1918678 100644
--- a/io_uring/filetable.c
+++ b/io_uring/filetable.c
@@ -76,19 +76,13 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
 	file_slot = io_fixed_file_slot(&ctx->file_table, slot_index);
 
 	if (file_slot->file_ptr) {
-		struct file *old_file;
-
 		ret = io_rsrc_node_switch_start(ctx);
 		if (ret)
 			goto err;
 
-		old_file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-		ret = io_queue_rsrc_removal(ctx->file_data, slot_index,
-					    ctx->rsrc_node, old_file);
+		ret = io_file_slot_queue_removal(ctx, file_slot);
 		if (ret)
 			goto err;
-		file_slot->file_ptr = 0;
-		io_file_bitmap_clear(&ctx->file_table, slot_index);
 		needs_switch = true;
 	}
 
@@ -148,7 +142,6 @@ int io_fixed_fd_install(struct io_kiocb *req, unsigned int issue_flags,
 int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset)
 {
 	struct io_fixed_file *file_slot;
-	struct file *file;
 	int ret;
 
 	if (unlikely(!ctx->file_data))
@@ -164,13 +157,10 @@ int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset)
 	if (!file_slot->file_ptr)
 		return -EBADF;
 
-	file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-	ret = io_queue_rsrc_removal(ctx->file_data, offset, ctx->rsrc_node, file);
+	ret = io_file_slot_queue_removal(ctx, file_slot);
 	if (ret)
 		return ret;
 
-	file_slot->file_ptr = 0;
-	io_file_bitmap_clear(&ctx->file_table, offset);
 	io_rsrc_node_switch(ctx, ctx->file_data);
 	return 0;
 }
@@ -191,3 +181,22 @@ int io_register_file_alloc_range(struct io_ring_ctx *ctx,
 	io_file_table_set_alloc_range(ctx, range.off, range.len);
 	return 0;
 }
+
+int io_file_slot_queue_removal(struct io_ring_ctx *ctx,
+			       struct io_fixed_file *file_slot)
+{
+	u32 slot_index = file_slot - ctx->file_table.files;
+	struct file *file;
+	int ret;
+
+	file = (struct file *)(file_slot->file_ptr & FFS_MASK);
+	ret = io_queue_rsrc_removal(ctx->file_data, slot_index,
+				    ctx->rsrc_node, file);
+	if (ret)
+		return ret;
+
+	file_slot->file_ptr = 0;
+	io_file_bitmap_clear(&ctx->file_table, slot_index);
+
+	return 0;
+}
diff --git a/io_uring/filetable.h b/io_uring/filetable.h
index ff3a712e11bf..e52ecf359199 100644
--- a/io_uring/filetable.h
+++ b/io_uring/filetable.h
@@ -34,6 +34,9 @@ int io_fixed_fd_remove(struct io_ring_ctx *ctx, unsigned int offset);
 int io_register_file_alloc_range(struct io_ring_ctx *ctx,
 				 struct io_uring_file_index_range __user *arg);
 
+int io_file_slot_queue_removal(struct io_ring_ctx *ctx,
+			       struct io_fixed_file *file_slot);
+
 unsigned int io_file_get_flags(struct file *file);
 
 static inline void io_file_bitmap_clear(struct io_file_table *table, int bit)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 59704b9ac537..1f10eecad4d7 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -469,12 +469,9 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 		file_slot = io_fixed_file_slot(&ctx->file_table, i);
 
 		if (file_slot->file_ptr) {
-			file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-			err = io_queue_rsrc_removal(data, i, ctx->rsrc_node, file);
+			err = io_file_slot_queue_removal(ctx, file_slot);
 			if (err)
 				break;
-			file_slot->file_ptr = 0;
-			io_file_bitmap_clear(&ctx->file_table, i);
 			needs_switch = true;
 		}
 		if (fd != -1) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv3 6/7] io_uring: add support for dma pre-mapping
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
                   ` (4 preceding siblings ...)
  2022-08-05 16:24 ` [PATCHv3 5/7] io_uring: introduce file slot release helper Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-05 16:24 ` [PATCHv3 7/7] nvme-pci: implement dma_map support Keith Busch
  2022-08-09  6:46 ` [PATCHv3 0/7] dma mapping optimisations Christoph Hellwig
  7 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Provide a new register operation that can request to pre-map a known
bvec to the requested fixed file's specific implementation. If
successful, io_uring will use the returned dma tag for future fixed
buffer requests to the same file.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/linux/io_uring_types.h |   2 +
 include/uapi/linux/io_uring.h  |  12 +++
 io_uring/filetable.c           |   7 +-
 io_uring/filetable.h           |   7 +-
 io_uring/io_uring.c            | 139 +++++++++++++++++++++++++++++++++
 io_uring/net.c                 |   2 +-
 io_uring/rsrc.c                |  22 ++++--
 io_uring/rsrc.h                |  10 ++-
 io_uring/rw.c                  |   2 +-
 9 files changed, 188 insertions(+), 15 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f7fab3758cb9..f62ea17cc480 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -23,6 +23,8 @@ struct io_wq_work {
 	int cancel_seq;
 };
 
+struct io_mapped_ubuf;
+
 struct io_fixed_file {
 	/* file * with additional FFS_* flags */
 	unsigned long file_ptr;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1463cfecb56b..daacbe899d1d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -485,6 +485,10 @@ enum {
 	IORING_REGISTER_NOTIFIERS		= 26,
 	IORING_UNREGISTER_NOTIFIERS		= 27,
 
+	/* dma map registered buffers */
+	IORING_REGISTER_MAP_BUFFERS		= 28,
+	IORING_REGISTER_UNMAP_BUFFERS		= 29,
+
 	/* this goes last */
 	IORING_REGISTER_LAST
 };
@@ -661,4 +665,12 @@ struct io_uring_recvmsg_out {
 	__u32 flags;
 };
 
+struct io_uring_map_buffers {
+	__s32	fd;
+	__s32	buf_start;
+	__s32	buf_end;
+	__u32	flags;
+	__u64	rsvd[2];
+};
+
 #endif
diff --git a/io_uring/filetable.c b/io_uring/filetable.c
index 1b8db1918678..5ca2f27f317f 100644
--- a/io_uring/filetable.c
+++ b/io_uring/filetable.c
@@ -189,9 +189,10 @@ int io_file_slot_queue_removal(struct io_ring_ctx *ctx,
 	struct file *file;
 	int ret;
 
-	file = (struct file *)(file_slot->file_ptr & FFS_MASK);
-	ret = io_queue_rsrc_removal(ctx->file_data, slot_index,
-				    ctx->rsrc_node, file);
+	file = io_file_from_fixed(file_slot);
+	io_dma_unmap_file(ctx, file_slot);
+	ret = io_queue_rsrc_removal(ctx->file_data, slot_index, ctx->rsrc_node,
+				    file);
 	if (ret)
 		return ret;
 
diff --git a/io_uring/filetable.h b/io_uring/filetable.h
index e52ecf359199..3b2aae5bff76 100644
--- a/io_uring/filetable.h
+++ b/io_uring/filetable.h
@@ -58,12 +58,17 @@ io_fixed_file_slot(struct io_file_table *table, unsigned i)
 	return &table->files[i];
 }
 
+static inline struct file *io_file_from_fixed(struct io_fixed_file *f)
+{
+	return (struct file *) (f->file_ptr & FFS_MASK);
+}
+
 static inline struct file *io_file_from_index(struct io_file_table *table,
 					      int index)
 {
 	struct io_fixed_file *slot = io_fixed_file_slot(table, index);
 
-	return (struct file *) (slot->file_ptr & FFS_MASK);
+	return io_file_from_fixed(slot);
 }
 
 static inline void io_fixed_file_set(struct io_fixed_file *file_slot,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index b54218da075c..7b44395877a3 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3681,6 +3681,133 @@ static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
 	return ret;
 }
 
+static int get_map_range(struct io_ring_ctx *ctx,
+			 struct io_uring_map_buffers *map, void __user *arg)
+{
+	int ret;
+
+	if (copy_from_user(map, arg, sizeof(*map)))
+		return -EFAULT;
+	if (map->flags || map->rsvd[0] || map->rsvd[1])
+		return -EINVAL;
+	if (map->fd >= ctx->nr_user_files)
+		return -EINVAL;
+	if (map->buf_start < 0)
+		return -EINVAL;
+	if (map->buf_start >= ctx->nr_user_bufs)
+		return -EINVAL;
+	if (map->buf_end > ctx->nr_user_bufs)
+		map->buf_end = ctx->nr_user_bufs;
+
+	ret = map->buf_end - map->buf_start;
+	if (ret <= 0)
+		return -EINVAL;
+
+	return ret;
+}
+
+void io_dma_unmap(struct io_mapped_ubuf *imu)
+{
+	if (!imu->dma_tag)
+		return;
+
+	file_dma_unmap(imu->dma_file, imu->dma_tag);
+	imu->dma_file = NULL;
+	imu->dma_tag = NULL;
+}
+
+void io_dma_unmap_file(struct io_ring_ctx *ctx, struct io_fixed_file *file_slot)
+{
+	struct file *file;
+	int i;
+
+	if (!file_slot->file_ptr)
+		return;
+
+	file = io_file_from_fixed(file_slot);
+	for (i = 0; i < ctx->nr_user_bufs; i++) {
+		struct io_mapped_ubuf *imu = ctx->user_bufs[i];
+
+		if (!imu->dma_tag)
+			continue;
+
+		if (imu->dma_file == file)
+			io_dma_unmap(imu);
+	}
+}
+
+static int io_register_unmap_buffers(struct io_ring_ctx *ctx, void __user *arg)
+{
+	struct io_uring_map_buffers map;
+	int i, ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	ret = get_map_range(ctx, &map, arg);
+	if (ret < 0)
+		return ret;
+
+	for (i = map.buf_start; i < map.buf_end; i++) {
+		struct io_mapped_ubuf *imu = ctx->user_bufs[i];
+
+		io_dma_unmap(imu);
+	}
+
+	return 0;
+}
+
+static int io_register_map_buffers(struct io_ring_ctx *ctx, void __user *arg)
+{
+	struct io_uring_map_buffers map;
+	struct io_fixed_file *file_slot;
+	struct file *file;
+	int ret, i;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	ret = get_map_range(ctx, &map, arg);
+	if (ret < 0)
+		return ret;
+
+	file_slot = io_fixed_file_slot(&ctx->file_table,
+			array_index_nospec(map.fd, ctx->nr_user_files));
+	if (!file_slot->file_ptr)
+		return -EBADF;
+
+	file = io_file_from_fixed(file_slot);
+	if (!(file->f_flags & O_DIRECT))
+		return -EOPNOTSUPP;
+
+	for (i = map.buf_start; i < map.buf_end; i++) {
+		struct io_mapped_ubuf *imu = ctx->user_bufs[i];
+		void *tag;
+
+		if (imu->dma_tag) {
+			ret = -EBUSY;
+			goto err;
+		}
+
+		tag = file_dma_map(file, imu->bvec, imu->nr_bvecs);
+		if (IS_ERR(tag)) {
+			ret = PTR_ERR(tag);
+			goto err;
+		}
+
+		imu->dma_tag = tag;
+		imu->dma_file = file;
+	}
+
+	return 0;
+err:
+	while (--i >= map.buf_start) {
+		struct io_mapped_ubuf *imu = ctx->user_bufs[i];
+
+		io_dma_unmap(imu);
+	}
+	return ret;
+}
+
 static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			       void __user *arg, unsigned nr_args)
 	__releases(ctx->uring_lock)
@@ -3847,6 +3974,18 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_notif_unregister(ctx);
 		break;
+	case IORING_REGISTER_MAP_BUFFERS:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_map_buffers(ctx, arg);
+		break;
+	case IORING_REGISTER_UNMAP_BUFFERS:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_unmap_buffers(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/net.c b/io_uring/net.c
index 32fc3da04e41..2793fd7d99d5 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -977,7 +977,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
 		ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu,
-					(u64)(uintptr_t)zc->buf, zc->len);
+					(u64)(uintptr_t)zc->buf, zc->len, NULL);
 		if (unlikely(ret))
 				return ret;
 	} else {
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 1f10eecad4d7..646e29bb904c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -148,6 +148,7 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
 			unpin_user_page(imu->bvec[i].bv_page);
 		if (imu->acct_pages)
 			io_unaccount_mem(ctx, imu->acct_pages);
+		io_dma_unmap(imu);
 		kvfree(imu);
 	}
 	*slot = NULL;
@@ -809,12 +810,17 @@ void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
 	int i;
 
 	for (i = 0; i < ctx->nr_user_files; i++) {
-		struct file *file = io_file_from_index(&ctx->file_table, i);
+		struct io_fixed_file *file_slot;
+		struct file *file;
 
-		if (!file)
+		file_slot = io_fixed_file_slot(&ctx->file_table, i);
+		if (!file_slot->file_ptr)
 			continue;
-		if (io_fixed_file_slot(&ctx->file_table, i)->file_ptr & FFS_SCM)
+		if (file_slot->file_ptr & FFS_SCM)
 			continue;
+
+		file = io_file_from_fixed(file_slot);
+		io_dma_unmap_file(ctx, file_slot);
 		io_file_bitmap_clear(&ctx->file_table, i);
 		fput(file);
 	}
@@ -1282,6 +1288,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	imu->ubuf = (unsigned long) iov->iov_base;
 	imu->ubuf_end = imu->ubuf + iov->iov_len;
 	imu->nr_bvecs = nr_pages;
+	imu->dma_tag = NULL;
 	*pimu = imu;
 	ret = 0;
 done:
@@ -1356,9 +1363,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len)
+int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu,
+		    u64 buf_addr, size_t len, struct file *file)
 {
 	u64 buf_end;
 	size_t offset;
@@ -1376,6 +1382,10 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 	 * and advance us to the beginning.
 	 */
 	offset = buf_addr - imu->ubuf;
+	if (imu->dma_tag && file == imu->dma_file) {
+		iov_iter_dma_tag(iter, ddir, imu->dma_tag, offset, len);
+		return 0;
+	}
 	iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
 
 	if (offset) {
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index f3a9a177941f..ed3637d6a55d 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,6 +50,8 @@ struct io_mapped_ubuf {
 	u64		ubuf_end;
 	unsigned int	nr_bvecs;
 	unsigned long	acct_pages;
+	void		*dma_tag;
+	struct file	*dma_file;
 	struct bio_vec	bvec[];
 };
 
@@ -64,9 +66,11 @@ int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
 void io_rsrc_node_switch(struct io_ring_ctx *ctx,
 			 struct io_rsrc_data *data_to_kill);
 
-int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len);
+int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu,
+		    u64 buf_addr, size_t len, struct file *file);
+
+void io_dma_unmap(struct io_mapped_ubuf *imu);
+void io_dma_unmap_file(struct io_ring_ctx *ctx, struct io_fixed_file *file_slot);
 
 void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 2b784795103c..9e2164d09adb 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -359,7 +359,7 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
 	ssize_t ret;
 
 	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
-		ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len);
+		ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len, req->file);
 		if (ret)
 			return ERR_PTR(ret);
 		return NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCHv3 7/7] nvme-pci: implement dma_map support
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
                   ` (5 preceding siblings ...)
  2022-08-05 16:24 ` [PATCHv3 6/7] io_uring: add support for dma pre-mapping Keith Busch
@ 2022-08-05 16:24 ` Keith Busch
  2022-08-09  6:46 ` [PATCHv3 0/7] dma mapping optimisations Christoph Hellwig
  7 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-05 16:24 UTC (permalink / raw)
  To: linux-nvme, linux-block, io-uring, linux-fsdevel
  Cc: axboe, hch, Alexander Viro, Kernel Team, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Implement callbacks to convert a registered bio_vec to a prp list, and
use this for each IO that uses the returned tag. This saves repeated IO
conversions and dma mapping/unmapping. In many cases, the driver can
skip per-IO pool allocations entirely, potentially reducing signficant
CPU cycles.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/pci.c | 314 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 303 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 71a4f26ba476..d42b00c6e041 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -104,12 +104,23 @@ static bool noacpi;
 module_param(noacpi, bool, 0444);
 MODULE_PARM_DESC(noacpi, "disable acpi bios quirks");
 
+static const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
+
 struct nvme_dev;
 struct nvme_queue;
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static bool __nvme_disable_io_queues(struct nvme_dev *dev, u8 opcode);
 
+struct nvme_dma_mapping {
+	int nr_pages;
+	u16 offset;
+	bool needs_sync;
+	u8  rsvd;
+	dma_addr_t prp_dma_addr;
+	__le64 *prps;
+};
+
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
  */
@@ -544,9 +555,30 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
 	return true;
 }
 
+static void nvme_sync_dma(struct nvme_dev *dev, struct request *req,
+			  struct nvme_dma_mapping *mapping)
+{
+	int offset, i, j, length, nprps;
+
+	offset = blk_rq_dma_offset(req) + mapping->offset;
+	i = offset >> NVME_CTRL_PAGE_SHIFT;
+
+	offset = offset & (NVME_CTRL_PAGE_SIZE - 1);
+	length = blk_rq_payload_bytes(req) - (NVME_CTRL_PAGE_SIZE - offset);
+	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
+
+	dma_sync_single_for_cpu(dev->dev,
+		le64_to_cpu(mapping->prps[i++]),
+		NVME_CTRL_PAGE_SIZE - offset, DMA_FROM_DEVICE);
+	for (j = 1; j < nprps; j++) {
+		dma_sync_single_for_cpu(dev->dev,
+			le64_to_cpu(mapping->prps[i++]),
+			NVME_CTRL_PAGE_SIZE, DMA_FROM_DEVICE);
+	}
+}
+
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
-	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	dma_addr_t dma_addr = iod->first_dma;
 	int i;
@@ -576,10 +608,24 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req)
 	}
 }
 
+static void nvme_free_prp_chain(struct nvme_dev *dev, struct request *req,
+				struct nvme_iod *iod)
+{
+	if (iod->npages == 0)
+		dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
+			      iod->first_dma);
+	else if (iod->use_sgl)
+		nvme_free_sgls(dev, req);
+	else
+		nvme_free_prps(dev, req);
+	mempool_free(iod->sg, dev->iod_mempool);
+}
+
 static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
+	WARN_ON_ONCE(!iod->nents);
 	if (is_pci_p2pdma_page(sg_page(iod->sg)))
 		pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
 				    rq_dma_dir(req));
@@ -589,25 +635,25 @@ static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
 
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
+	struct nvme_dma_mapping *mapping = blk_rq_dma_tag(req);
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 
+	if (mapping) {
+		if (mapping->needs_sync && rq_data_dir(req) == READ)
+			nvme_sync_dma(dev, req, mapping);
+		if (iod->npages >= 0)
+			nvme_free_prp_chain(dev, req, iod);
+		return;
+	}
+
 	if (iod->dma_len) {
 		dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
 			       rq_dma_dir(req));
 		return;
 	}
 
-	WARN_ON_ONCE(!iod->nents);
-
 	nvme_unmap_sg(dev, req);
-	if (iod->npages == 0)
-		dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
-			      iod->first_dma);
-	else if (iod->use_sgl)
-		nvme_free_sgls(dev, req);
-	else
-		nvme_free_prps(dev, req);
-	mempool_free(iod->sg, dev->iod_mempool);
+	nvme_free_prp_chain(dev, req, iod);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -835,13 +881,145 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 	return BLK_STS_OK;
 }
 
+static blk_status_t nvme_premapped_slow(struct nvme_dev *dev,
+				struct request *req,  struct nvme_iod *iod,
+				struct nvme_dma_mapping *mapping, int nprps)
+{
+	struct dma_pool *pool;
+	dma_addr_t prp_dma;
+	__le64 *prp_list;
+	void **list;
+	int i;
+
+	iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
+	if (!iod->sg)
+		return BLK_STS_RESOURCE;
+
+	if (nprps <= (256 / 8)) {
+		pool = dev->prp_small_pool;
+		iod->npages = 0;
+	} else {
+		pool = dev->prp_page_pool;
+		iod->npages = 1;
+	}
+
+	prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
+	if (!prp_list) {
+		iod->npages = -1;
+		goto out_free_sg;
+	}
+
+	list = nvme_pci_iod_list(req);
+	list[0] = prp_list;
+	iod->first_dma = prp_dma;
+
+	for (;;) {
+		dma_addr_t next_prp_dma;
+		__le64 *next_prp_list;
+
+		if (nprps <= last_prp + 1) {
+			memcpy(prp_list, &mapping->prps[i], nprps * 8);
+			break;
+		}
+
+		memcpy(prp_list, &mapping->prps[i], NVME_CTRL_PAGE_SIZE - 8);
+		nprps -= last_prp;
+		i += last_prp;
+
+		next_prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &next_prp_dma);
+		if (!next_prp_list)
+			goto free_prps;
+
+		prp_list[last_prp] = cpu_to_le64(next_prp_dma);
+		prp_list = next_prp_list;
+		prp_dma = next_prp_dma;
+		list[iod->npages++] = prp_list;
+	}
+	return BLK_STS_OK;
+
+free_prps:
+	nvme_free_prps(dev, req);
+out_free_sg:
+	mempool_free(iod->sg, dev->iod_mempool);
+	return BLK_STS_RESOURCE;
+}
+
+static blk_status_t nvme_premapped(struct nvme_dev *dev, struct request *req,
+				   struct nvme_dma_mapping *mapping,
+				   struct nvme_rw_command *cmnd,
+				   struct nvme_iod *iod)
+{
+	bool needs_sync = mapping->needs_sync && rq_data_dir(req) == WRITE;
+	dma_addr_t prp_list_start, prp_list_end;
+	int i, offset, j, length, nprps;
+	blk_status_t ret;
+
+	offset = blk_rq_dma_offset(req) + mapping->offset;
+	i = offset >> NVME_CTRL_PAGE_SHIFT;
+
+	if (needs_sync)
+		dma_sync_single_for_device(dev->dev,
+			le64_to_cpu(mapping->prps[i]),
+			NVME_CTRL_PAGE_SIZE - offset, DMA_TO_DEVICE);
+
+	offset = offset & (NVME_CTRL_PAGE_SIZE - 1);
+	cmnd->dptr.prp1 = cpu_to_le64(le64_to_cpu(mapping->prps[i++]) + offset);
+
+	length = blk_rq_payload_bytes(req) - (NVME_CTRL_PAGE_SIZE - offset);
+	if (length <= 0)
+		return BLK_STS_OK;
+
+	if (length <= NVME_CTRL_PAGE_SIZE) {
+		if (needs_sync)
+			dma_sync_single_for_device(dev->dev,
+				le64_to_cpu(mapping->prps[i]),
+				NVME_CTRL_PAGE_SIZE, DMA_TO_DEVICE);
+		cmnd->dptr.prp2 = mapping->prps[i];
+		return BLK_STS_OK;
+	}
+
+	nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
+	prp_list_start = mapping->prp_dma_addr + 8 * i;
+	prp_list_end = prp_list_start + 8 * nprps;
+
+	/* Optimization when remaining list fits in one nvme page */
+	if ((prp_list_start >> NVME_CTRL_PAGE_SHIFT) ==
+	    (prp_list_end >> NVME_CTRL_PAGE_SHIFT)) {
+		cmnd->dptr.prp2 = cpu_to_le64(prp_list_start);
+		goto sync;
+	}
+
+	ret = nvme_premapped_slow(dev, req, iod, mapping, nprps);
+	if (ret != BLK_STS_OK)
+		return ret;
+
+	cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
+sync:
+	if (!needs_sync)
+		return BLK_STS_OK;
+
+	i = offset >> NVME_CTRL_PAGE_SHIFT;
+	for (j = 0; j < nprps; j++)
+		dma_sync_single_for_device(dev->dev,
+			le64_to_cpu(mapping->prps[i++]),
+			NVME_CTRL_PAGE_SIZE, DMA_TO_DEVICE);
+	return BLK_STS_OK;
+}
+
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
+	struct nvme_dma_mapping *mapping = blk_rq_dma_tag(req);
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int nr_mapped;
 
+	if (mapping) {
+		iod->dma_len = 0;
+		iod->use_sgl = false;
+		return nvme_premapped(dev, req, mapping, &cmnd->rw, iod);
+	}
+
 	if (blk_rq_nr_phys_segments(req) == 1) {
 		struct bio_vec bv = req_bvec(req);
 
@@ -1732,6 +1910,116 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
 	return result;
 }
 
+#ifdef CONFIG_HAS_DMA
+/*
+ * Important: bvec must be describing a virtually contiguous buffer.
+ */
+static void *nvme_pci_dma_map(struct request_queue *q,
+			       struct bio_vec *bvec, int nr_vecs)
+{
+	const int nvme_pages = 1 << (PAGE_SIZE - NVME_CTRL_PAGE_SIZE);
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+	struct nvme_dma_mapping *mapping;
+	int i, j, k, size, ppv, ret = -ENOMEM;
+
+	if (!nr_vecs)
+		return ERR_PTR(-EINVAL);
+
+	mapping = kzalloc(sizeof(*mapping), GFP_KERNEL);
+	if (!mapping)
+		return ERR_PTR(-ENOMEM);
+
+	mapping->nr_pages = nr_vecs * nvme_pages;
+	size = sizeof(*mapping->prps) * mapping->nr_pages;
+	mapping->prps = dma_alloc_coherent(dev->dev, size,
+				&mapping->prp_dma_addr, GFP_KERNEL);
+	if (!mapping->prps)
+		goto free_mapping;
+
+	mapping->needs_sync = false;
+	for (i = 0, k = 0; i < nr_vecs; i++) {
+		struct bio_vec *bv = bvec + i;
+		dma_addr_t dma_addr;
+
+		ppv = nvme_pages;
+		if (i == 0) {
+			mapping->offset = bv->bv_offset;
+			ppv -= mapping->offset >> NVME_CTRL_PAGE_SHIFT;
+		} else if (bv->bv_offset) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		if (bv->bv_offset + bv->bv_len != PAGE_SIZE &&
+		    i < nr_vecs - 1) {
+			ret = -EINVAL;
+			goto err;
+		}
+
+		dma_addr = dma_map_bvec(dev->dev, bv, 0, 0);
+		if (dma_mapping_error(dev->dev, dma_addr)) {
+			ret = -EIO;
+			goto err;
+		}
+
+		if (i == 0)
+			dma_addr -= mapping->offset;
+
+		if (dma_need_sync(dev->dev, dma_addr))
+			mapping->needs_sync = true;
+
+		for (j = 0; j < ppv; j++)
+			mapping->prps[k++] = cpu_to_le64(dma_addr +
+						j * NVME_CTRL_PAGE_SIZE);
+	}
+
+	get_device(dev->dev);
+	return mapping;
+
+err:
+	for (i = 0; i < k; i += ppv) {
+		__u64 dma_addr = le64_to_cpu(mapping->prps[i]);
+		ppv = nvme_pages;
+
+		if (i == 0)
+			ppv -= mapping->offset >> NVME_CTRL_PAGE_SHIFT;
+		dma_unmap_page(dev->dev, dma_addr,
+			       PAGE_SIZE - offset_in_page(dma_addr), 0);
+	}
+
+	dma_free_coherent(dev->dev, size, (void *)mapping->prps,
+			  mapping->prp_dma_addr);
+free_mapping:
+	kfree(mapping);
+	return ERR_PTR(ret);
+}
+
+static void nvme_pci_dma_unmap(struct request_queue *q, void *dma_tag)
+{
+	const int nvme_pages = 1 << (PAGE_SIZE - NVME_CTRL_PAGE_SIZE);
+	struct nvme_ns *ns = q->queuedata;
+	struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+	struct nvme_dma_mapping *mapping = dma_tag;
+	int i, ppv;
+
+	for (i = 0; i < mapping->nr_pages; i += ppv) {
+		__u64 dma_addr = le64_to_cpu(mapping->prps[i]);
+		ppv = nvme_pages;
+
+		if (i == 0)
+			ppv -= mapping->offset >> NVME_CTRL_PAGE_SHIFT;
+		dma_unmap_page(dev->dev, dma_addr,
+			       PAGE_SIZE - offset_in_page(dma_addr), 0);
+	}
+
+	dma_free_coherent(dev->dev, mapping->nr_pages * sizeof(*mapping->prps),
+			  (void *)mapping->prps, mapping->prp_dma_addr);
+	kfree(mapping);
+	put_device(dev->dev);
+}
+#endif
+
 static const struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_pci_complete_rq,
@@ -1750,6 +2038,10 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.map_queues	= nvme_pci_map_queues,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
+#ifdef CONFIG_HAS_DMA
+	.dma_map	= nvme_pci_dma_map,
+	.dma_unmap	= nvme_pci_dma_unmap,
+#endif
 };
 
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-05 16:24 ` [PATCHv3 2/7] file: " Keith Busch
@ 2022-08-08  0:21   ` Dave Chinner
  2022-08-08  1:13     ` Matthew Wilcox
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2022-08-08  0:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, io-uring, linux-fsdevel, axboe, hch,
	Alexander Viro, Kernel Team, Keith Busch

On Fri, Aug 05, 2022 at 09:24:39AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> The same buffer may be used for many subsequent IO's. Instead of setting
> up the mapping per-IO, provide an interface that can allow a buffer to
> be premapped just once and referenced again later, and implement for the
> block device file.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/fops.c       | 20 ++++++++++++++++++++
>  fs/file.c          | 15 +++++++++++++++
>  include/linux/fs.h | 20 ++++++++++++++++++++
>  3 files changed, 55 insertions(+)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 29066ac5a2fa..db2d1e848f4b 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -670,6 +670,22 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	return error;
>  }
>  
> +#ifdef CONFIG_HAS_DMA
> +void *blkdev_dma_map(struct file *filp, struct bio_vec *bvec, int nr_vecs)
> +{
> +	struct block_device *bdev = filp->private_data;
> +
> +	return block_dma_map(bdev, bvec, nr_vecs);
> +}
> +
> +void blkdev_dma_unmap(struct file *filp, void *dma_tag)
> +{
> +	struct block_device *bdev = filp->private_data;
> +
> +	return block_dma_unmap(bdev, dma_tag);
> +}
> +#endif
> +
>  const struct file_operations def_blk_fops = {
>  	.open		= blkdev_open,
>  	.release	= blkdev_close,
> @@ -686,6 +702,10 @@ const struct file_operations def_blk_fops = {
>  	.splice_read	= generic_file_splice_read,
>  	.splice_write	= iter_file_splice_write,
>  	.fallocate	= blkdev_fallocate,
> +#ifdef CONFIG_HAS_DMA
> +	.dma_map	= blkdev_dma_map,
> +	.dma_unmap	= blkdev_dma_unmap,
> +#endif
>  };
>  
>  static __init int blkdev_init(void)
> diff --git a/fs/file.c b/fs/file.c
> index 3bcc1ecc314a..767bf9d3205e 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -1307,3 +1307,18 @@ int iterate_fd(struct files_struct *files, unsigned n,
>  	return res;
>  }
>  EXPORT_SYMBOL(iterate_fd);
> +
> +#ifdef CONFIG_HAS_DMA
> +void *file_dma_map(struct file *file, struct bio_vec *bvec, int nr_vecs)
> +{
> +	if (file->f_op->dma_map)
> +		return file->f_op->dma_map(file, bvec, nr_vecs);
> +	return ERR_PTR(-EINVAL);
> +}
> +
> +void file_dma_unmap(struct file *file, void *dma_tag)
> +{
> +	if (file->f_op->dma_unmap)
> +		return file->f_op->dma_unmap(file, dma_tag);
> +}
> +#endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9f131e559d05..8652bad763f3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2092,6 +2092,10 @@ struct dir_context {
>  struct iov_iter;
>  struct io_uring_cmd;
>  
> +#ifdef CONFIG_HAS_DMA
> +struct bio_vec;
> +#endif
> +
>  struct file_operations {
>  	struct module *owner;
>  	loff_t (*llseek) (struct file *, loff_t, int);
> @@ -2134,6 +2138,10 @@ struct file_operations {
>  				   loff_t len, unsigned int remap_flags);
>  	int (*fadvise)(struct file *, loff_t, loff_t, int);
>  	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
> +#ifdef CONFIG_HAS_DMA
> +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> +	void (*dma_unmap)(struct file *, void *);
> +#endif
>  } __randomize_layout;

This just smells wrong. Using a block layer specific construct as a
primary file operation parameter shouts "layering violation" to me.

Indeed, I can't see how this can be used by anything other than a
block device file on a single, stand-alone block device. It's
mapping a region of memory to something that has no file offset or
length associated with it, and the implementation of the callout is
specially pulling the bdev from the private file data.

What we really need is a callout that returns the bdevs that the
struct file is mapped to (one, or many), so the caller can then map
the memory addresses to the block devices itself. The caller then
needs to do an {file, offset, len} -> {bdev, sector, count}
translation so the io_uring code can then use the correct bdev and
dma mappings for the file offset that the user is doing IO to/from.

For a stand-alone block device, the "get bdevs" callout is pretty
simple. single device filesystems are trivial, too. XFS is trivial -
it will return 1 or 2 block devices. stacked bdevs need to iterate
recursively, as would filesystems like btrfs. Still, pretty easy,
and for the case you care about here has almost zero overhead.

Now you have a list of all the bdevs you are going to need to add
dma mappings for, and you can call the bdev directly to set them up.
THere is no need what-so-ever to do this through through the file
operations layer - it's completely contained at the block device
layer and below.

Then, for each file IO range, we need a mapping callout in the file
operations structure. That will take a  {file, offset, len} tuple
and return a {bdev, sector, count} tuple that maps part or all of
the file data.

Again, for a standalone block device, this is simply a translation
of filep->private to bdev, and offset,len from byte counts to sector
counts. Trival, almost no overhead at all.

For filesystems and stacked block devices, though, this gives you
back all the information you need to select the right set of dma
buffers and the {sector, count} information you need to issue the IO
correctly. Setting this up is now all block device layer
manipulation.[*]

This is where I think this patchset needs to go, not bulldoze
through abstractions that get in the way because all you are
implementing is a special fast path for a single niche use case. We
know how to make it work with filesystems and stacked devices, so
can we please start with an API that allows us to implement the
functionality without having to completely rewrite all the code that
you are proposing to add right now?

Cheers,

Dave.

[*] For the purposes of brevity, I'm ignoring the elephant in the
middle of the room: how do you ensure that the filesystem doesn't
run a truncate or hole punch while you have an outstanding DMA
mapping and io_uring is doing IO direct to file offset via that
mapping? i.e. how do you prevent such a non-filesystem controlled IO
path from accessing to stale data (i.e.  use-after-free) of on-disk
storage because there is nothing serialising it against other
filesystem operations?

This is very similar to the direct storage access issues that
DAX+RDMA and pNFS file layouts have. pNFS solves it with file layout
leases, and DAX has all the hooks into the filesystems needed to use
file layout leases in place, too. I'd suggest that IO via io_uring
persistent DMA mappings outside the scope of the filesysetm
controlled IO path also need layout lease guarantees to avoid user
IO from racing with truncate, etc....

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-08  0:21   ` Dave Chinner
@ 2022-08-08  1:13     ` Matthew Wilcox
  2022-08-08  2:15       ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2022-08-08  1:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, hch, Alexander Viro, Kernel Team, Keith Busch

On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > +#ifdef CONFIG_HAS_DMA
> > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > +	void (*dma_unmap)(struct file *, void *);
> > +#endif
> 
> This just smells wrong. Using a block layer specific construct as a
> primary file operation parameter shouts "layering violation" to me.

A bio_vec is also used for networking; it's in disguise as an skb_frag,
but it's there.

> What we really need is a callout that returns the bdevs that the
> struct file is mapped to (one, or many), so the caller can then map
> the memory addresses to the block devices itself. The caller then
> needs to do an {file, offset, len} -> {bdev, sector, count}
> translation so the io_uring code can then use the correct bdev and
> dma mappings for the file offset that the user is doing IO to/from.

I don't even know if what you're proposing is possible.  Consider a
network filesystem which might transparently be moved from one network
interface to another.  I don't even know if the filesystem would know
which network device is going to be used for the IO at the time of
IO submission.

I think a totally different model is needed where we can find out if
the bvec contains pages which are already mapped to the device, and map
them if they aren't.  That also handles a DM case where extra devices
are hot-added to a RAID, for example.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-08  1:13     ` Matthew Wilcox
@ 2022-08-08  2:15       ` Dave Chinner
  2022-08-08  2:49         ` Matthew Wilcox
  2022-08-08 10:14         ` Pavel Begunkov
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2022-08-08  2:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, hch, Alexander Viro, Kernel Team, Keith Busch

On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > > +#ifdef CONFIG_HAS_DMA
> > > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > > +	void (*dma_unmap)(struct file *, void *);
> > > +#endif
> > 
> > This just smells wrong. Using a block layer specific construct as a
> > primary file operation parameter shouts "layering violation" to me.
> 
> A bio_vec is also used for networking; it's in disguise as an skb_frag,
> but it's there.

Which is just as awful. Just because it's done somewhere else
doesn't make it right.

> > What we really need is a callout that returns the bdevs that the
> > struct file is mapped to (one, or many), so the caller can then map
> > the memory addresses to the block devices itself. The caller then
> > needs to do an {file, offset, len} -> {bdev, sector, count}
> > translation so the io_uring code can then use the correct bdev and
> > dma mappings for the file offset that the user is doing IO to/from.
> 
> I don't even know if what you're proposing is possible.  Consider a
> network filesystem which might transparently be moved from one network
> interface to another.  I don't even know if the filesystem would know
> which network device is going to be used for the IO at the time of
> IO submission.

Sure, but nobody is suggesting we support direct DMA buffer mapping
and reuse for network devices right now, whereas we have working
code for block devices in front of us.

What I want to see is broad-based generic block device based
filesysetm support, not niche functionality that can only work on a
single type of block device. Network filesystems and devices are a
*long* way from being able to do anything like this, so I don't see
a need to cater for them at this point in time.

When someone has a network device abstraction and network filesystem
that can do direct data placement based on that device abstraction,
then we can talk about the high level interface we should use to
drive it....

> I think a totally different model is needed where we can find out if
> the bvec contains pages which are already mapped to the device, and map
> them if they aren't.  That also handles a DM case where extra devices
> are hot-added to a RAID, for example.

I cannot form a picture of what you are suggesting from such a brief
description. Care to explain in more detail?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-08  2:15       ` Dave Chinner
@ 2022-08-08  2:49         ` Matthew Wilcox
  2022-08-08  7:31           ` Dave Chinner
  2022-08-08 10:14         ` Pavel Begunkov
  1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2022-08-08  2:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, hch, Alexander Viro, Kernel Team, Keith Busch

On Mon, Aug 08, 2022 at 12:15:01PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
> > On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > > > +#ifdef CONFIG_HAS_DMA
> > > > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > > > +	void (*dma_unmap)(struct file *, void *);
> > > > +#endif
> > > 
> > > This just smells wrong. Using a block layer specific construct as a
> > > primary file operation parameter shouts "layering violation" to me.
> > 
> > A bio_vec is also used for networking; it's in disguise as an skb_frag,
> > but it's there.
> 
> Which is just as awful. Just because it's done somewhere else
> doesn't make it right.
> 
> > > What we really need is a callout that returns the bdevs that the
> > > struct file is mapped to (one, or many), so the caller can then map
> > > the memory addresses to the block devices itself. The caller then
> > > needs to do an {file, offset, len} -> {bdev, sector, count}
> > > translation so the io_uring code can then use the correct bdev and
> > > dma mappings for the file offset that the user is doing IO to/from.
> > 
> > I don't even know if what you're proposing is possible.  Consider a
> > network filesystem which might transparently be moved from one network
> > interface to another.  I don't even know if the filesystem would know
> > which network device is going to be used for the IO at the time of
> > IO submission.
> 
> Sure, but nobody is suggesting we support direct DMA buffer mapping
> and reuse for network devices right now, whereas we have working
> code for block devices in front of us.

But we have working code already (merged) in the networking layer for
reusing pages that are mapped to particular devices.

> What I want to see is broad-based generic block device based
> filesysetm support, not niche functionality that can only work on a
> single type of block device. Network filesystems and devices are a
> *long* way from being able to do anything like this, so I don't see
> a need to cater for them at this point in time.
> 
> When someone has a network device abstraction and network filesystem
> that can do direct data placement based on that device abstraction,
> then we can talk about the high level interface we should use to
> drive it....
> 
> > I think a totally different model is needed where we can find out if
> > the bvec contains pages which are already mapped to the device, and map
> > them if they aren't.  That also handles a DM case where extra devices
> > are hot-added to a RAID, for example.
> 
> I cannot form a picture of what you are suggesting from such a brief
> description. Care to explain in more detail?

Let's suppose you have a RAID 5 of NVMe devices.  One fails and now
the RAID-5 is operating in degraded mode.  So you hot-unplug the failed
device, plug in a new NVMe drive and add it to the RAID.  The pages now
need to be DMA mapped to that new PCI device.

What I'm saying is that the set of devices that the pages need to be
mapped to is not static and cannot be known at "setup time", even given
the additional information that you were proposing earlier in this thread.
It has to be dynamically adjusted.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-08  2:49         ` Matthew Wilcox
@ 2022-08-08  7:31           ` Dave Chinner
  2022-08-08 15:28             ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2022-08-08  7:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, hch, Alexander Viro, Kernel Team, Keith Busch

On Mon, Aug 08, 2022 at 03:49:09AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 08, 2022 at 12:15:01PM +1000, Dave Chinner wrote:
> > On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
> > > On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
> > > > > +#ifdef CONFIG_HAS_DMA
> > > > > +	void *(*dma_map)(struct file *, struct bio_vec *, int);
> > > > > +	void (*dma_unmap)(struct file *, void *);
> > > > > +#endif
> > > > 
> > > > This just smells wrong. Using a block layer specific construct as a
> > > > primary file operation parameter shouts "layering violation" to me.
> > > 
> > > A bio_vec is also used for networking; it's in disguise as an skb_frag,
> > > but it's there.
> > 
> > Which is just as awful. Just because it's done somewhere else
> > doesn't make it right.
> > 
> > > > What we really need is a callout that returns the bdevs that the
> > > > struct file is mapped to (one, or many), so the caller can then map
> > > > the memory addresses to the block devices itself. The caller then
> > > > needs to do an {file, offset, len} -> {bdev, sector, count}
> > > > translation so the io_uring code can then use the correct bdev and
> > > > dma mappings for the file offset that the user is doing IO to/from.
> > > 
> > > I don't even know if what you're proposing is possible.  Consider a
> > > network filesystem which might transparently be moved from one network
> > > interface to another.  I don't even know if the filesystem would know
> > > which network device is going to be used for the IO at the time of
> > > IO submission.
> > 
> > Sure, but nobody is suggesting we support direct DMA buffer mapping
> > and reuse for network devices right now, whereas we have working
> > code for block devices in front of us.
> 
> But we have working code already (merged) in the networking layer for
> reusing pages that are mapped to particular devices.

Great! How is it hooked up to the network filesystems? I'm kinda
betting that it isn't at all - it's the kernel bypass paths that use
these device based mappings, right? And the user applications are
bound directly to the devices, unlike network filesytsems?

> > What I want to see is broad-based generic block device based
> > filesysetm support, not niche functionality that can only work on a
> > single type of block device. Network filesystems and devices are a
> > *long* way from being able to do anything like this, so I don't see
> > a need to cater for them at this point in time.
> > 
> > When someone has a network device abstraction and network filesystem
> > that can do direct data placement based on that device abstraction,
> > then we can talk about the high level interface we should use to
> > drive it....
> > 
> > > I think a totally different model is needed where we can find out if
> > > the bvec contains pages which are already mapped to the device, and map
> > > them if they aren't.  That also handles a DM case where extra devices
> > > are hot-added to a RAID, for example.
> > 
> > I cannot form a picture of what you are suggesting from such a brief
> > description. Care to explain in more detail?
> 
> Let's suppose you have a RAID 5 of NVMe devices.  One fails and now
> the RAID-5 is operating in degraded mode.

Yes, but this is purely an example of a stacked device type that
requires fined grained mapping of data offset to block device and
offset. When the device fails, it just doesn't return a data mapping
that points to the failed device.

> So you hot-unplug the failed
> device, plug in a new NVMe drive and add it to the RAID.  The pages now
> need to be DMA mapped to that new PCI device.

yup, and now the dma tags for the mappings to that sub-device return
errors, which then tell the application that it needs to remap the
dma buffers it is using.

That's just bog standard error handling - if a bdev goes away,
access to the dma tags have to return IO errors, and it is up to the
application level (i.e. the io_uring code) to handle that sanely.

> What I'm saying is that the set of devices that the pages need to be
> mapped to is not static and cannot be known at "setup time", even given
> the additional information that you were proposing earlier in this thread.
> It has to be dynamically adjusted.

Sure, I'm assuming that IOs based on dma tags will fail if there's a
bdev issue.  The DMA mappings have to be set up somewhere, and it
has to be done before the IO is started. That means there has to be
"discovery" done at "setup time", and if there's an issue between
setup and IO submission, then an error will result and the IO setup
code is going to have to handle that. I can't see how this would
work any other way....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-08  2:15       ` Dave Chinner
  2022-08-08  2:49         ` Matthew Wilcox
@ 2022-08-08 10:14         ` Pavel Begunkov
  1 sibling, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2022-08-08 10:14 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, hch, Alexander Viro, Kernel Team, Keith Busch

On 8/8/22 03:15, Dave Chinner wrote:
> On Mon, Aug 08, 2022 at 02:13:41AM +0100, Matthew Wilcox wrote:
>> On Mon, Aug 08, 2022 at 10:21:24AM +1000, Dave Chinner wrote:
>>>> +#ifdef CONFIG_HAS_DMA
>>>> +	void *(*dma_map)(struct file *, struct bio_vec *, int);
>>>> +	void (*dma_unmap)(struct file *, void *);
>>>> +#endif
>>>
>>> This just smells wrong. Using a block layer specific construct as a
>>> primary file operation parameter shouts "layering violation" to me.
>>
>> A bio_vec is also used for networking; it's in disguise as an skb_frag,
>> but it's there.
> 
> Which is just as awful. Just because it's done somewhere else
> doesn't make it right.
> 
>>> What we really need is a callout that returns the bdevs that the
>>> struct file is mapped to (one, or many), so the caller can then map
>>> the memory addresses to the block devices itself. The caller then
>>> needs to do an {file, offset, len} -> {bdev, sector, count}
>>> translation so the io_uring code can then use the correct bdev and
>>> dma mappings for the file offset that the user is doing IO to/from.
>>
>> I don't even know if what you're proposing is possible.  Consider a
>> network filesystem which might transparently be moved from one network
>> interface to another.  I don't even know if the filesystem would know
>> which network device is going to be used for the IO at the time of
>> IO submission.
> 
> Sure, but nobody is suggesting we support direct DMA buffer mapping
> and reuse for network devices right now, whereas we have working
> code for block devices in front of us.

Networking is not so far away, with zerocopy tx landed the next target
is peer-to-peer, i.e. transfers from a device memory. It's nothing
new and was already tried out quite some time ago, but to be fair,
it's not ready yet as this patchset. In any case, they have to use
common infra, which means we can't rely on struct block_device.

The first idea was to have a callback returning a struct device
pointer and failing when the file can have multiple devices or change
them on the fly. Networking already has a hook to assign a device to
a socket, we just need to make it's immutable after the assignment.
 From the userspace perspective, if host memory mapping failed it can
be re-registered as a normal io_uring registered buffer with no change
in the API on the submission side.

I like the idea to reserve ranges in the API for future use, but
as I understand it, io_uring would need to do device lookups based on
the I/O offset, which doesn't sound fast and I'm not convinced we want
to go this way now. Could work if the specified range covers only one
device but needs knowledge of how it's chunked and doesn't go well
when devices alternate every 4KB or so.

Another question is whether we want to have some kind of notion of
device groups so the userspace doesn't have to register a buffer
multiple times when the mapping can be shared b/w files.


> What I want to see is broad-based generic block device based
> filesysetm support, not niche functionality that can only work on a
> single type of block device. Network filesystems and devices are a
> *long* way from being able to do anything like this, so I don't see
> a need to cater for them at this point in time.
> 
> When someone has a network device abstraction and network filesystem
> that can do direct data placement based on that device abstraction,
> then we can talk about the high level interface we should use to
> drive it....
> 
>> I think a totally different model is needed where we can find out if
>> the bvec contains pages which are already mapped to the device, and map
>> them if they aren't.  That also handles a DM case where extra devices
>> are hot-added to a RAID, for example.
> 
> I cannot form a picture of what you are suggesting from such a brief
> description. Care to explain in more detail?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 2/7] file: add ops to dma map bvec
  2022-08-08  7:31           ` Dave Chinner
@ 2022-08-08 15:28             ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-08 15:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Keith Busch, linux-nvme, linux-block, io-uring,
	linux-fsdevel, axboe, hch, Alexander Viro, Kernel Team

On Mon, Aug 08, 2022 at 05:31:34PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2022 at 03:49:09AM +0100, Matthew Wilcox wrote:
> 
> > So you hot-unplug the failed
> > device, plug in a new NVMe drive and add it to the RAID.  The pages now
> > need to be DMA mapped to that new PCI device.
> 
> yup, and now the dma tags for the mappings to that sub-device return
> errors, which then tell the application that it needs to remap the
> dma buffers it is using.
> 
> That's just bog standard error handling - if a bdev goes away,
> access to the dma tags have to return IO errors, and it is up to the
> application level (i.e. the io_uring code) to handle that sanely.

I didn't think anyone should see IO errors in such scenarios. This feature is
more of an optional optimization, and everything should work as it does today
if a tag becomes invalid.

For md raid or multi-device filesystem, I imagined this would return dma tag
that demuxes to dma tags of the member devices. If any particular member device
doesn't have a dma tag for whatever reason, the filesystem or md would
transparently fall back to the registered bvec that it currently uses when it
needs to do IO to that device.

If you do a RAID hot-swap, MD could request a new dma tag for the new device
without io_uring knowing about the event. MD can continue servicing new IO
referencing its dma tag, and use the new device's tag only once the setup is
complete.

I'm not familiar enough with the networking side, but I thought the file level
abstraction would allow similar handling without io_uring's knowledge.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
                   ` (6 preceding siblings ...)
  2022-08-05 16:24 ` [PATCHv3 7/7] nvme-pci: implement dma_map support Keith Busch
@ 2022-08-09  6:46 ` Christoph Hellwig
  2022-08-09 14:18   ` Keith Busch
  2022-08-09 16:46   ` Keith Busch
  7 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-08-09  6:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, io-uring, linux-fsdevel, axboe, hch,
	Alexander Viro, Kernel Team, Keith Busch

I finally found some time to look over this, and I have some
very high level problems with the implementations:

 - the DMA streaming ops (dma_map_*) are only intended for short term
   mappings, not for long living ones that can lead to starvation.
   Especially with swiotlb, but also with some IOMMUs.  I think this
   needs to be changed to an API that can allocate DMA mapped memory
   using dma_alloc_noncoherent/dma_alloc_noncontigious for the device
   and then give access to that to the user instead
 - I really do not like the special casing in the bio.  Did you try
   to just stash away the DMA mapping information in an efficient
   lookup data structure (e.g. rthashtable, but details might need
   analysis and benchmarking) and thus touch very little code outside
   of the driver I/O path and the place that performs the mapping?
 - the design seems to ignore DMA ownership.  Every time data in
   transfered data needs to be transferred to and from the device,
   take a look at Documentation/core-api/dma-api.rst and
   Documentation/core-api/dma-api-howto.rst.

As for the multiple devices discussion:  mapping memory for multiple
devices is possible, but nontrivial to get right, mostly due to the
ownership.  So unless we have a really good reason I'd suggest to
not support this initially.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-09  6:46 ` [PATCHv3 0/7] dma mapping optimisations Christoph Hellwig
@ 2022-08-09 14:18   ` Keith Busch
  2022-08-09 18:39     ` Christoph Hellwig
  2022-08-09 16:46   ` Keith Busch
  1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2022-08-09 14:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, Alexander Viro, Kernel Team

On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
>  - the design seems to ignore DMA ownership.  Every time data in
>    transfered data needs to be transferred to and from the device,
>    take a look at Documentation/core-api/dma-api.rst and
>    Documentation/core-api/dma-api-howto.rst.

I have this doing appropriate dma_sync_single_for_{device,cpu} if we aren't
using coherent memory. Is there more to ownership beyond that?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-09  6:46 ` [PATCHv3 0/7] dma mapping optimisations Christoph Hellwig
  2022-08-09 14:18   ` Keith Busch
@ 2022-08-09 16:46   ` Keith Busch
  2022-08-09 18:41     ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2022-08-09 16:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, Alexander Viro, Kernel Team

On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
>  - the DMA streaming ops (dma_map_*) are only intended for short term
>    mappings, not for long living ones that can lead to starvation.
>    Especially with swiotlb, but also with some IOMMUs.  I think this
>    needs to be changed to an API that can allocate DMA mapped memory
>    using dma_alloc_noncoherent/dma_alloc_noncontigious for the device
>    and then give access to that to the user instead

I am a bit reluctant to require a new memory allocator to use the feature. I'm
also generally not concerned about the odd resource constrained IOMMU. User
space drivers pre-map all their memory resources up front, and this is
essentially the same concept.

For swiotlb, though, we can error out the mapping if the requested memory uses
swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if
is_swiotlb_buffer() is true. Would that be more acceptable?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-09 14:18   ` Keith Busch
@ 2022-08-09 18:39     ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2022-08-09 18:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	io-uring, linux-fsdevel, axboe, Alexander Viro, Kernel Team

On Tue, Aug 09, 2022 at 08:18:45AM -0600, Keith Busch wrote:
> On Tue, Aug 09, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
> >  - the design seems to ignore DMA ownership.  Every time data in
> >    transfered data needs to be transferred to and from the device,
> >    take a look at Documentation/core-api/dma-api.rst and
> >    Documentation/core-api/dma-api-howto.rst.
> 
> I have this doing appropriate dma_sync_single_for_{device,cpu} if we aren't
> using coherent memory. Is there more to ownership beyond that?

As long as we only every support a single device that is fine, but
if we want to support that it gets complicated.  Sorry, this should
have been integrated with the mumblings on the multiple device mappings
as the statement iѕ false without that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-09 16:46   ` Keith Busch
@ 2022-08-09 18:41     ` Christoph Hellwig
  2022-08-10 18:05       ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-08-09 18:41 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	io-uring, linux-fsdevel, axboe, Alexander Viro, Kernel Team

On Tue, Aug 09, 2022 at 10:46:04AM -0600, Keith Busch wrote:
> I am a bit reluctant to require a new memory allocator to use the feature. I'm
> also generally not concerned about the odd resource constrained IOMMU. User
> space drivers pre-map all their memory resources up front, and this is
> essentially the same concept.

Yes, it really isn't an issue for the modern iommus that support vfio,
but it is a limitation on the dma-api.  The old iommus aren't really the
issue, but..

> For swiotlb, though, we can error out the mapping if the requested memory uses
> swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if
> is_swiotlb_buffer() is true. Would that be more acceptable?

No, is_swiotlb_buffer and similar are not exported APIs.  More importantly
with the various secure hypervisor schemes swiotlb is unfortunately
actually massively increasing these days.  On those systems all streaming
mappings use swiotlb.  And the only way to get any kind of half-decent
I/O performance would be the "special" premapped allocator, which is
another reason why I'd like to see it.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-09 18:41     ` Christoph Hellwig
@ 2022-08-10 18:05       ` Keith Busch
  2022-08-11  7:22         ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2022-08-10 18:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, Alexander Viro, Kernel Team

On Tue, Aug 09, 2022 at 08:41:37PM +0200, Christoph Hellwig wrote:
> On Tue, Aug 09, 2022 at 10:46:04AM -0600, Keith Busch wrote:
> 
> > For swiotlb, though, we can error out the mapping if the requested memory uses
> > swiotlb with the device: the driver's .dma_map() can return ENOTSUPP if
> > is_swiotlb_buffer() is true. Would that be more acceptable?
> 
> No, is_swiotlb_buffer and similar are not exported APIs.

The functions are implemented under 'include/linux/', indistinguishable from
exported APIs. I think I understand why they are there, but they look the same
as exported functions from a driver perspective.

> More importantly with the various secure hypervisor schemes swiotlb is
> unfortunately actually massively increasing these days.  On those systems all
> streaming mappings use swiotlb.  And the only way to get any kind of
> half-decent I/O performance would be the "special" premapped allocator, which
> is another reason why I'd like to see it.

Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb
leverages this feature. If you're using that, you've traded performance for
security or compatibility already. If this idea can be used to make it perform
better, then great, but that shouldn't be the reason to hold this up IMO.

This optimization needs to be easy to reach if we expect anyone to use it.
Working with arbitrary user addresses with minimal additions to the user ABI
was deliberate. If you want a special allocator, we can always add one later;
this series doesn't affect that.

If this has potential to starve system resource though, I can constrain it to
specific users like CAP_SYS_ADMIN, or maybe only memory allocated from
hugetlbfs. Or perhaps a more complicated scheme of shuffling dma mapping
resources on demand if that is an improvement over the status quo.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-10 18:05       ` Keith Busch
@ 2022-08-11  7:22         ` Christoph Hellwig
  2022-08-31 21:19           ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2022-08-11  7:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Keith Busch, linux-nvme, linux-block,
	io-uring, linux-fsdevel, axboe, Alexander Viro, Kernel Team

On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote:
> The functions are implemented under 'include/linux/', indistinguishable from
> exported APIs. I think I understand why they are there, but they look the same
> as exported functions from a driver perspective.

swiotlb.h is not a driver API.  There's two leftovers used by the drm
code I'm trying to get fixed up, but in general the DMA API is the
interface and swiotlb is just an implementation detail.

> Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb
> leverages this feature. If you're using that, you've traded performance for
> security or compatibility already. If this idea can be used to make it perform
> better, then great, but that shouldn't be the reason to hold this up IMO.

We firstly need to make sure that everything actually works on swiotlb, or
any other implementation that properly implements the DMA API.

And the fact that I/O performance currently sucks and we can fix it on
the trusted hypervisor is an important consideration.  At least as
importantant as micro-optimizing performance a little more on setups
not using them.  So not taking care of both in one go seems rather silly
for a feature that is in its current form pretty intrusive and thus needs
a really good justification.

> This optimization needs to be easy to reach if we expect anyone to use it.
> Working with arbitrary user addresses with minimal additions to the user ABI
> was deliberate. If you want a special allocator, we can always add one later;
> this series doesn't affect that.
> 
> If this has potential to starve system resource though, I can constrain it to
> specific users like CAP_SYS_ADMIN, or maybe only memory allocated from
> hugetlbfs. Or perhaps a more complicated scheme of shuffling dma mapping
> resources on demand if that is an improvement over the status quo.

Or just not bother with it at all.  Because with all those limits it
really does not seems to be worth to an entirely need type of bio
payload to the block layer and a lot of boilerplate to drivers.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCHv3 0/7] dma mapping optimisations
  2022-08-11  7:22         ` Christoph Hellwig
@ 2022-08-31 21:19           ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2022-08-31 21:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, linux-block, io-uring, linux-fsdevel,
	axboe, Alexander Viro, Kernel Team

On Thu, Aug 11, 2022 at 09:22:32AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote:
> > The functions are implemented under 'include/linux/', indistinguishable from
> > exported APIs. I think I understand why they are there, but they look the same
> > as exported functions from a driver perspective.
> 
> swiotlb.h is not a driver API.  There's two leftovers used by the drm
> code I'm trying to get fixed up, but in general the DMA API is the
> interface and swiotlb is just an implementation detail.
> 
> > Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb
> > leverages this feature. If you're using that, you've traded performance for
> > security or compatibility already. If this idea can be used to make it perform
> > better, then great, but that shouldn't be the reason to hold this up IMO.
> 
> We firstly need to make sure that everything actually works on swiotlb, or
> any other implementation that properly implements the DMA API.
> 
> And the fact that I/O performance currently sucks and we can fix it on
> the trusted hypervisor is an important consideration.  At least as
> importantant as micro-optimizing performance a little more on setups
> not using them.  So not taking care of both in one go seems rather silly
> for a feature that is in its current form pretty intrusive and thus needs
> a really good justification.

Sorry for the delay response; I had some trouble with test setup.

Okay, I will restart developing this with swiotlb in mind.

In the mean time, I wanted to share some results with this series because I'm
thinking this might be past the threshold for when we can drop the "micro-"
prefix on optimisations.

The most significant data points are these:

  * submission latency stays the same regardless of the transfer size or depth
  * IOPs is always equal or better (usually better) with up to 50% reduced
    cpu cost

Based on this, I do think this type of optimisation is worth having a something
like a new bio type. I know this introduces some complications in the io-path,
but it is pretty minimal and doesn't add any size penalties to common structs
for drivers that don't use them.

Test details:

  fio with ioengine=io_uring
    'none': using __user void*
    'bvec': using buf registered with IORING_REGISTER_BUFFERS
    'dma': using buf registered with IORING_REGISTER_MAP_BUFFERS (new)

  intel_iommu=on

Results:

(submission latency [slat] in nano-seconds)
Q-Depth 1:

 Size |  Premap  |   IOPs  |  slat  | sys-cpu%
 .....|..........|.........|........|.........
 4k   |    none  |  41.4k  |  2126  |   16.47%
      |    bvec  |  43.8k  |  1843  |   15.79%
      |     dma  |  46.8k  |  1504  |   14.94%
 16k  |    none  |  33.3k  |  3279  |   17.78%
      |    bvec  |  33.9k  |  2607  |   14.59%
      |     dma  |  40.2k  |  1490  |   12.57%
 64k  |    none  |  18.7k  |  6778  |   18.22%
      |    bvec  |  20.0k  |  4626  |   13.80%
      |     dma  |  22.6k  |  1586  |    7.58%

Q-Depth 16:

 Size |  Premap  |   IOPs  |  slat  | sys-cpu%
 .....|..........|.........|........|.........
 4k   |    none  |   207k  |  3657  |   72.81%
      |    bvec  |   219k  |  3369  |   71.55%
      |     dma  |   310k  |  2237  |   60.16%
 16k  |    none  |   164k  |  5024  |   78.38%
      |    bvec  |   177k  |  4553  |   76.29%
      |     dma  |   186k  |  1880  |   43.56%
 64k  |    none  |  46.7k  |  4424  |   30.51%
      |    bvec  |  46.7k  |  4389  |   29.42%
      |     dma  |  46.7k  |  1574  |   15.61%


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-08-31 21:29 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-05 16:24 [PATCHv3 0/7] dma mapping optimisations Keith Busch
2022-08-05 16:24 ` [PATCHv3 1/7] blk-mq: add ops to dma map bvec Keith Busch
2022-08-05 16:24 ` [PATCHv3 2/7] file: " Keith Busch
2022-08-08  0:21   ` Dave Chinner
2022-08-08  1:13     ` Matthew Wilcox
2022-08-08  2:15       ` Dave Chinner
2022-08-08  2:49         ` Matthew Wilcox
2022-08-08  7:31           ` Dave Chinner
2022-08-08 15:28             ` Keith Busch
2022-08-08 10:14         ` Pavel Begunkov
2022-08-05 16:24 ` [PATCHv3 3/7] iov_iter: introduce type for preregistered dma tags Keith Busch
2022-08-05 16:24 ` [PATCHv3 4/7] block: add dma tag bio type Keith Busch
2022-08-05 16:24 ` [PATCHv3 5/7] io_uring: introduce file slot release helper Keith Busch
2022-08-05 16:24 ` [PATCHv3 6/7] io_uring: add support for dma pre-mapping Keith Busch
2022-08-05 16:24 ` [PATCHv3 7/7] nvme-pci: implement dma_map support Keith Busch
2022-08-09  6:46 ` [PATCHv3 0/7] dma mapping optimisations Christoph Hellwig
2022-08-09 14:18   ` Keith Busch
2022-08-09 18:39     ` Christoph Hellwig
2022-08-09 16:46   ` Keith Busch
2022-08-09 18:41     ` Christoph Hellwig
2022-08-10 18:05       ` Keith Busch
2022-08-11  7:22         ` Christoph Hellwig
2022-08-31 21:19           ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.