linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
@ 2023-02-10 15:32 Ming Lei
  2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Ming Lei @ 2023-02-10 15:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, Ming Lei

Hello,

Add two OPs which buffer is retrieved via kernel splice for supporting
fuse/ublk zero copy.

The 1st patch enhances direct pipe & splice for moving pages in kernel,
so that the two added OPs won't be misused, and avoid potential security
hole.

The 2nd patch allows splice_direct_to_actor() caller to ignore signal
if the actor won't block and can be done in bound time.

The 3rd patch add the two OPs.

The 4th patch implements ublk's ->splice_read() for supporting
zero copy.

ublksrv(userspace):

https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf
    
So far, only loop/null target implements zero copy in above branch:
    
	ublk add -t loop -f $file -z
	ublk add -t none -z

Basic FS/IO function is verified, mount/kernel building & fio
works fine, and big chunk IO(BS: 64k/512k) performance gets improved
obviously.
 
Any comment is welcome!

Ming Lei (4):
  fs/splice: enhance direct pipe & splice for moving pages in kernel
  fs/splice: allow to ignore signal in __splice_from_pipe
  io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  ublk_drv: support splice based read/write zero copy

 drivers/block/ublk_drv.c      | 169 +++++++++++++++++++++++++++++++--
 fs/splice.c                   |  19 +++-
 include/linux/pipe_fs_i.h     |  10 ++
 include/linux/splice.h        |  23 +++++
 include/uapi/linux/io_uring.h |   2 +
 include/uapi/linux/ublk_cmd.h |  31 +++++-
 io_uring/opdef.c              |  37 ++++++++
 io_uring/rw.c                 | 174 +++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 |   1 +
 9 files changed, 456 insertions(+), 10 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
@ 2023-02-10 15:32 ` Ming Lei
  2023-02-11 15:42   ` Ming Lei
  2023-02-10 15:32 ` [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-10 15:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, Ming Lei

Per-task direct pipe can transfer page between two files or one
file and other kernel components, especially by splice_direct_to_actor
and __splice_from_pipe().

This way is helpful for fuse/ublk to implement zero copy by transferring
pages from device to file or socket. However, when device's ->splice_read()
produces pages, the kernel consumer may read from or write to these pages,
and from device viewpoint, there could be unexpected read or write on
pages.

Enhance the limit by the following approach:

1) add kernel splice flags of SPLICE_F_KERN_FOR_[READ|WRITE] which is
   passed to device's ->read_splice(), then device can check if this
   READ or WRITE is expected on pages filled to pipe together with
   information from ppos & len

2) add kernel splice flag of SPLICE_F_KERN_NEED_CONFIRM which is passed
   to device's ->read_splice() for asking device to confirm if it
   really supports this kind of usage of feeding pages by ->read_splice().
   If device does support, pipe->ack_page_consuming is set. This way
   can avoid misuse.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/splice.c               | 15 +++++++++++++++
 include/linux/pipe_fs_i.h | 10 ++++++++++
 include/linux/splice.h    | 22 ++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 87d9b19349de..c4770e1644cc 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -792,6 +792,14 @@ static long do_splice_to(struct file *in, loff_t *ppos,
 	return in->f_op->splice_read(in, ppos, pipe, len, flags);
 }
 
+static inline bool slice_read_acked(const struct pipe_inode_info *pipe,
+			       int flags)
+{
+	if (flags & SPLICE_F_KERN_NEED_CONFIRM)
+		return pipe->ack_page_consuming;
+	return true;
+}
+
 /**
  * splice_direct_to_actor - splices data directly between two non-pipes
  * @in:		file to splice from
@@ -861,10 +869,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
 
+		pipe->ack_page_consuming = false;
 		ret = do_splice_to(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))
 			goto out_release;
 
+		if (!slice_read_acked(pipe, flags)) {
+			bytes = 0;
+			ret = -EACCES;
+			goto out_release;
+		}
+
 		read_len = ret;
 		sd->total_len = read_len;
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 6cb65df3e3ba..09ee1a9380ec 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -72,6 +72,7 @@ struct pipe_inode_info {
 	unsigned int r_counter;
 	unsigned int w_counter;
 	bool poll_usage;
+	bool ack_page_consuming;	/* only for direct pipe */
 	struct page *tmp_page;
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
@@ -218,6 +219,15 @@ static inline void pipe_discard_from(struct pipe_inode_info *pipe,
 		pipe_buf_release(pipe, &pipe->bufs[--pipe->head & mask]);
 }
 
+/*
+ * Called in ->splice_read() for confirming the READ/WRITE page is allowed
+ */
+static inline void pipe_ack_page_consume(struct pipe_inode_info *pipe)
+{
+	if (!WARN_ON_ONCE(current->splice_pipe != pipe))
+		pipe->ack_page_consuming = true;
+}
+
 /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
    memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
 #define PIPE_SIZE		PAGE_SIZE
diff --git a/include/linux/splice.h b/include/linux/splice.h
index a55179fd60fc..98c471fd918d 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -23,6 +23,28 @@
 
 #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
 
+/*
+ * Flags used for kernel internal page move from ->splice_read()
+ * by internal direct pipe, and user pipe can't touch these
+ * flags.
+ *
+ * Pages filled from ->splice_read() are usually moved/copied to
+ * ->splice_write(). Here address fuse/ublk zero copy by transferring
+ * page from device to file/socket for either READ or WRITE. So we
+ * need ->splice_read() to confirm if this READ/WRITE is allowed on
+ * pages filled in ->splice_read().
+ */
+/* The page consumer is for READ from pages moved from direct pipe */
+#define SPLICE_F_KERN_FOR_READ	(0x100)
+/* The page consumer is for WRITE to pages moved from direct pipe */
+#define SPLICE_F_KERN_FOR_WRITE	(0x200)
+/*
+ * ->splice_read() has to confirm if consumer's READ/WRITE on pages
+ * is allow. If yes, ->splice_read() has to set pipe->ack_page_consuming,
+ * otherwise pipe->ack_page_consuming has to be cleared.
+ */
+#define SPLICE_F_KERN_NEED_CONFIRM	(0x400)
+
 /*
  * Passed to the actors
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe
  2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
  2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
@ 2023-02-10 15:32 ` Ming Lei
  2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2023-02-10 15:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, Ming Lei

__splice_from_pipe() is used for splice data from pipe, and the actor
could be simply grabbing pages, so if the caller can confirm this
actor won't block, it isn't necessary to return -ERESTARTSYS.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 fs/splice.c            | 4 ++--
 include/linux/splice.h | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index c4770e1644cc..a8dc46db1045 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -471,7 +471,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des
 	 * Check for signal early to make process killable when there are
 	 * always buffers available
 	 */
-	if (signal_pending(current))
+	if (signal_pending(current) && !sd->ignore_sig)
 		return -ERESTARTSYS;
 
 repeat:
@@ -485,7 +485,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des
 		if (sd->flags & SPLICE_F_NONBLOCK)
 			return -EAGAIN;
 
-		if (signal_pending(current))
+		if (signal_pending(current) && !sd->ignore_sig)
 			return -ERESTARTSYS;
 
 		if (sd->need_wakeup) {
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 98c471fd918d..89e0a0f8b471 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -64,6 +64,7 @@ struct splice_desc {
 	loff_t *opos;			/* sendfile: output position */
 	size_t num_spliced;		/* number of bytes already spliced */
 	bool need_wakeup;		/* need to wake up writer */
+	bool ignore_sig;
 };
 
 struct partial_page {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
  2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
  2023-02-10 15:32 ` [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Ming Lei
@ 2023-02-10 15:32 ` Ming Lei
  2023-02-11 15:45   ` Jens Axboe
  2023-02-11 17:13   ` Jens Axboe
  2023-02-10 15:32 ` [PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Ming Lei @ 2023-02-10 15:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, Ming Lei

IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
for building buffer.

IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
for building buffer.

The typical use case is for supporting ublk/fuse io_uring zero copy,
and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
from device->read_splice(), then READ/WRITE can be done to/from this
buffer directly.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/uapi/linux/io_uring.h |   2 +
 io_uring/opdef.c              |  37 ++++++++
 io_uring/rw.c                 | 174 +++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 |   1 +
 4 files changed, 213 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 636a4c2c1294..bada0c91a350 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -223,6 +223,8 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_READ_SPLICE_BUF,
+	IORING_OP_WRITE_SPLICE_BUF,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 5238ecd7af6a..91e8d8f96134 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_READ_SPLICE_BUF] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
+		.plug			= 1,
+		.audit_skip		= 1,
+		.ioprio			= 1,
+		.iopoll			= 1,
+		.iopoll_queue		= 1,
+		.prep			= io_prep_rw,
+		.issue			= io_read,
+	},
+	[IORING_OP_WRITE_SPLICE_BUF] = {
+		.needs_file		= 1,
+		.hash_reg_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
+		.plug			= 1,
+		.audit_skip		= 1,
+		.ioprio			= 1,
+		.iopoll			= 1,
+		.iopoll_queue		= 1,
+		.prep			= io_prep_rw,
+		.issue			= io_write,
+	},
 };
 
 
@@ -647,6 +672,18 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_READ_SPLICE_BUF] = {
+		.async_size		= sizeof(struct io_async_rw),
+		.name			= "READ_TO_SPLICE_BUF",
+		.cleanup		= io_read_write_cleanup,
+		.fail			= io_rw_fail,
+	},
+	[IORING_OP_WRITE_SPLICE_BUF] = {
+		.async_size		= sizeof(struct io_async_rw),
+		.name			= "WRITE_FROM_SPICE_BUF",
+		.cleanup		= io_read_write_cleanup,
+		.fail			= io_rw_fail,
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index efe6bfda9ca9..381514fd1bc5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -73,6 +73,175 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req)
 	return 0;
 }
 
+struct io_rw_splice_buf_data {
+	unsigned long total;
+	unsigned int  max_bvecs;
+	struct io_mapped_ubuf **imu;
+};
+
+/* the max size of whole 'io_mapped_ubuf' allocation is one page */
+static inline unsigned int io_rw_max_splice_buf_bvecs(void)
+{
+	return (PAGE_SIZE - sizeof(struct io_mapped_ubuf)) /
+			sizeof(struct bio_vec);
+}
+
+static inline unsigned int io_rw_splice_buf_nr_bvecs(unsigned long len)
+{
+	return min_t(unsigned int, (len + PAGE_SIZE - 1) >> PAGE_SHIFT,
+			io_rw_max_splice_buf_bvecs());
+}
+
+static inline bool io_rw_splice_buf(struct io_kiocb *req)
+{
+	return req->opcode == IORING_OP_READ_SPLICE_BUF ||
+		req->opcode == IORING_OP_WRITE_SPLICE_BUF;
+}
+
+static void io_rw_cleanup_splice_buf(struct io_kiocb *req)
+{
+	struct io_mapped_ubuf *imu = req->imu;
+	int i;
+
+	if (!imu)
+		return;
+
+	for (i = 0; i < imu->nr_bvecs; i++)
+		put_page(imu->bvec[i].bv_page);
+
+	req->imu = NULL;
+	kfree(imu);
+}
+
+static int io_splice_buf_actor(struct pipe_inode_info *pipe,
+			       struct pipe_buffer *buf,
+			       struct splice_desc *sd)
+{
+	struct io_rw_splice_buf_data *data = sd->u.data;
+	struct io_mapped_ubuf *imu = *data->imu;
+	struct bio_vec *bvec;
+
+	if (imu->nr_bvecs >= data->max_bvecs) {
+		/*
+		 * Double bvec allocation given we don't know
+		 * how many remains
+		 */
+		unsigned nr_bvecs = min(data->max_bvecs * 2,
+				io_rw_max_splice_buf_bvecs());
+		struct io_mapped_ubuf *new_imu;
+
+		/* can't grow, given up */
+		if (nr_bvecs <= data->max_bvecs)
+			return 0;
+
+		new_imu = krealloc(imu, struct_size(imu, bvec, nr_bvecs),
+				GFP_KERNEL);
+		if (!new_imu)
+			return -ENOMEM;
+		imu = new_imu;
+		data->max_bvecs = nr_bvecs;
+		*data->imu = imu;
+	}
+
+	if (!try_get_page(buf->page))
+		return -EINVAL;
+
+	bvec = &imu->bvec[imu->nr_bvecs];
+	bvec->bv_page = buf->page;
+	bvec->bv_offset = buf->offset;
+	bvec->bv_len = buf->len;
+	imu->nr_bvecs++;
+	data->total += buf->len;
+
+	return buf->len;
+}
+
+static int io_splice_buf_direct_actor(struct pipe_inode_info *pipe,
+			       struct splice_desc *sd)
+{
+	return __splice_from_pipe(pipe, sd, io_splice_buf_actor);
+}
+
+static int __io_prep_rw_splice_buf(struct io_kiocb *req,
+				   struct io_rw_splice_buf_data *data,
+				   struct file *splice_f,
+				   size_t len,
+				   loff_t splice_off)
+{
+	unsigned flags = req->opcode == IORING_OP_READ_SPLICE_BUF ?
+			SPLICE_F_KERN_FOR_READ : SPLICE_F_KERN_FOR_WRITE;
+	struct splice_desc sd = {
+		.total_len = len,
+		.flags = flags | SPLICE_F_NONBLOCK | SPLICE_F_KERN_NEED_CONFIRM,
+		.pos = splice_off,
+		.u.data = data,
+		.ignore_sig = true,
+	};
+
+	return splice_direct_to_actor(splice_f, &sd,
+			io_splice_buf_direct_actor);
+}
+
+static int io_prep_rw_splice_buf(struct io_kiocb *req,
+				 const struct io_uring_sqe *sqe)
+{
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
+	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
+	struct io_rw_splice_buf_data data;
+	struct io_mapped_ubuf *imu;
+	struct fd splice_fd;
+	int ret;
+
+	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
+	if (!splice_fd.file)
+		return -EBADF;
+
+	ret = -EBADF;
+	if (!(splice_fd.file->f_mode & FMODE_READ))
+		goto out_put_fd;
+
+	ret = -ENOMEM;
+	imu = kmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+	if (!imu)
+		goto out_put_fd;
+
+	/* splice buffer actually hasn't virtual address */
+	imu->nr_bvecs = 0;
+
+	data.max_bvecs = nr_pages;
+	data.total = 0;
+	data.imu = &imu;
+
+	rw->addr = 0;
+	req->flags |= REQ_F_NEED_CLEANUP;
+
+	ret = __io_prep_rw_splice_buf(req, &data, splice_fd.file, rw->len,
+			splice_off);
+	imu = *data.imu;
+	imu->acct_pages = 0;
+	imu->ubuf = 0;
+	imu->ubuf_end = data.total;
+	rw->len = data.total;
+	req->imu = imu;
+	if (!data.total) {
+		io_rw_cleanup_splice_buf(req);
+	} else  {
+		ret = 0;
+	}
+out_put_fd:
+	if (splice_fd.file)
+		fdput(splice_fd);
+
+	return ret;
+}
+
+void io_read_write_cleanup(struct io_kiocb *req)
+{
+	if (io_rw_splice_buf(req))
+		io_rw_cleanup_splice_buf(req);
+}
+
 int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
@@ -117,6 +286,8 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		ret = io_iov_buffer_select_prep(req);
 		if (ret)
 			return ret;
+	} else if (io_rw_splice_buf(req)) {
+		return io_prep_rw_splice_buf(req, sqe);
 	}
 
 	return 0;
@@ -371,7 +542,8 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
 	size_t sqe_len;
 	ssize_t ret;
 
-	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
+	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED ||
+			io_rw_splice_buf(req)) {
 		ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len);
 		if (ret)
 			return ERR_PTR(ret);
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3b733f4b610a..b37d6f6ecb6a 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -21,4 +21,5 @@ int io_readv_prep_async(struct io_kiocb *req);
 int io_write(struct io_kiocb *req, unsigned int issue_flags);
 int io_writev_prep_async(struct io_kiocb *req);
 void io_readv_writev_cleanup(struct io_kiocb *req);
+void io_read_write_cleanup(struct io_kiocb *req);
 void io_rw_fail(struct io_kiocb *req);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] ublk_drv: support splice based read/write zero copy
  2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
                   ` (2 preceding siblings ...)
  2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
@ 2023-02-10 15:32 ` Ming Lei
  2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
  2023-02-14 16:36 ` Stefan Hajnoczi
  5 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2023-02-10 15:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, Ming Lei

The initial idea of using splice for zero copy is from Miklos and Stefan.

Now io_uring supports IORING_OP_READ[WRITE]_SPLICE_BUF, and ublk can
pass request pages via direct/kernel private pipe to backend io_uring
read/write handling code.

This zero copy implementation improves sequential IO performance
obviously.

ublksrv code:

https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf

So far, only loop/null target supports zero copy:

	ublk add -t loop -f $file -z
	ublk add -t none -z

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 169 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/ublk_cmd.h |  31 ++++++-
 2 files changed, 193 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index e6eceee44366..5ef5f2ccb0d5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -43,6 +43,8 @@
 #include <asm/page.h>
 #include <linux/task_work.h>
 #include <linux/namei.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
 #include <uapi/linux/ublk_cmd.h>
 
 #define UBLK_MINORS		(1U << MINORBITS)
@@ -154,6 +156,8 @@ struct ublk_device {
 	unsigned long		state;
 	int			ub_number;
 
+	struct srcu_struct	srcu;
+
 	struct mutex		mutex;
 
 	spinlock_t		mm_lock;
@@ -537,6 +541,9 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 	if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
 		return rq_bytes;
 
+	if (ubq->flags & UBLK_F_SUPPORT_ZERO_COPY)
+		return rq_bytes;
+
 	if (ublk_rq_has_data(req)) {
 		struct ublk_map_data data = {
 			.ubq	=	ubq,
@@ -558,6 +565,9 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
+	if (ubq->flags & UBLK_F_SUPPORT_ZERO_COPY)
+		return rq_bytes;
+
 	if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) {
 		struct ublk_map_data data = {
 			.ubq	=	ubq,
@@ -1221,6 +1231,7 @@ static void ublk_stop_dev(struct ublk_device *ub)
 	del_gendisk(ub->ub_disk);
 	ub->dev_info.state = UBLK_S_DEV_DEAD;
 	ub->dev_info.ublksrv_pid = -1;
+	synchronize_srcu(&ub->srcu);
 	put_disk(ub->ub_disk);
 	ub->ub_disk = NULL;
  unlock:
@@ -1355,13 +1366,155 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	return -EIOCBQUEUED;
 }
 
+static void ublk_pipe_buf_release(struct pipe_inode_info *pipe,
+			      struct pipe_buffer *buf)
+{
+}
+
+static const struct pipe_buf_operations ublk_pipe_buf_ops = {
+        .release        = ublk_pipe_buf_release,
+};
+
+static inline bool ublk_check_splice_rw(const struct request *req,
+		unsigned int flags)
+{
+	flags &= (SPLICE_F_KERN_FOR_READ | SPLICE_F_KERN_FOR_WRITE);
+
+	if (req_op(req) == REQ_OP_READ && flags == SPLICE_F_KERN_FOR_READ)
+		return true;
+
+	if (req_op(req) == REQ_OP_WRITE && flags == SPLICE_F_KERN_FOR_WRITE)
+		return true;
+
+	return false;
+}
+
+static ssize_t ublk_splice_read(struct file *in, loff_t *ppos,
+		struct pipe_inode_info *pipe,
+		size_t len, unsigned int flags)
+{
+	struct ublk_device *ub = in->private_data;
+	struct req_iterator rq_iter;
+	struct bio_vec bv;
+	struct request *req;
+	struct ublk_queue *ubq;
+	u16 tag, q_id;
+	unsigned int done;
+	int ret, buf_offset, srcu_idx;
+
+	if (!ub)
+		return -EPERM;
+
+	/* only support direct pipe and we do need confirm */
+	if (pipe != current->splice_pipe || !(flags &
+				SPLICE_F_KERN_NEED_CONFIRM))
+		return -EACCES;
+
+	ret = -EINVAL;
+
+	/* protect request queue & disk removed */
+	srcu_idx = srcu_read_lock(&ub->srcu);
+
+	if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+		goto exit;
+
+	tag = ublk_pos_to_tag(*ppos);
+	q_id = ublk_pos_to_hwq(*ppos);
+	buf_offset = ublk_pos_to_buf_offset(*ppos);
+
+	if (q_id >= ub->dev_info.nr_hw_queues)
+		goto exit;
+
+	ubq = ublk_get_queue(ub, q_id);
+	if (!ubq)
+		goto exit;
+
+	if (!(ubq->flags & UBLK_F_SUPPORT_ZERO_COPY))
+		goto exit;
+
+	/*
+	 * So far just support splice read buffer from ubq daemon context
+	 * because request may be gone in ->splice_read() if the splice
+	 * is called from other context.
+	 *
+	 * TODO: add request protection and relax the following limit.
+	 */
+	if (ubq->ubq_daemon != current)
+		goto exit;
+
+	if (tag >= ubq->q_depth)
+		goto exit;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag);
+	if (!req || !blk_mq_request_started(req))
+		goto exit;
+
+	pr_devel("%s: qid %d tag %u offset %x, request bytes %u, len %llu flags %x\n",
+			__func__, tag, q_id, buf_offset, blk_rq_bytes(req),
+			(unsigned long long)len, flags);
+
+	if (!ublk_check_splice_rw(req, flags))
+		goto exit;
+
+	if (!ublk_rq_has_data(req) || !len)
+		goto exit;
+
+	if (buf_offset + len > blk_rq_bytes(req))
+		goto exit;
+
+	done = ret = 0;
+	rq_for_each_bvec(bv, req, rq_iter) {
+		struct pipe_buffer buf = {
+			.ops = &ublk_pipe_buf_ops,
+			.flags = 0,
+			.page = bv.bv_page,
+			.offset = bv.bv_offset,
+			.len = bv.bv_len,
+		};
+
+		if (buf_offset > 0) {
+			if (buf_offset >= bv.bv_len) {
+				buf_offset -= bv.bv_len;
+				continue;
+			} else {
+				buf.offset += buf_offset;
+				buf.len -= buf_offset;
+				buf_offset = 0;
+			}
+		}
+
+		if (done + buf.len > len)
+			buf.len = len - done;
+		done += buf.len;
+
+		ret = add_to_pipe(pipe, &buf);
+		if (unlikely(ret < 0)) {
+			done -= buf.len;
+			break;
+		}
+		if (done >= len)
+			break;
+	}
+
+	if (done) {
+		*ppos += done;
+		ret = done;
+
+		pipe_ack_page_consume(pipe);
+	}
+exit:
+	srcu_read_unlock(&ub->srcu, srcu_idx);
+	return ret;
+}
+
 static const struct file_operations ublk_ch_fops = {
 	.owner = THIS_MODULE,
 	.open = ublk_ch_open,
 	.release = ublk_ch_release,
-	.llseek = no_llseek,
+	.llseek = noop_llseek,
 	.uring_cmd = ublk_ch_uring_cmd,
 	.mmap = ublk_ch_mmap,
+	.splice_read = ublk_splice_read,
 };
 
 static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
@@ -1472,6 +1625,7 @@ static void ublk_cdev_rel(struct device *dev)
 	ublk_deinit_queues(ub);
 	ublk_free_dev_number(ub);
 	mutex_destroy(&ub->mutex);
+	cleanup_srcu_struct(&ub->srcu);
 	kfree(ub);
 }
 
@@ -1600,17 +1754,18 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
 		set_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
 
 	get_device(&ub->cdev_dev);
+	ub->dev_info.state = UBLK_S_DEV_LIVE;
 	ret = add_disk(disk);
 	if (ret) {
 		/*
 		 * Has to drop the reference since ->free_disk won't be
 		 * called in case of add_disk failure.
 		 */
+		ub->dev_info.state = UBLK_S_DEV_DEAD;
 		ublk_put_device(ub);
 		goto out_put_disk;
 	}
 	set_bit(UB_STATE_USED, &ub->state);
-	ub->dev_info.state = UBLK_S_DEV_LIVE;
 out_put_disk:
 	if (ret)
 		put_disk(disk);
@@ -1718,6 +1873,9 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 	ub = kzalloc(sizeof(*ub), GFP_KERNEL);
 	if (!ub)
 		goto out_unlock;
+	ret = init_srcu_struct(&ub->srcu);
+	if (ret)
+		goto out_free_ub;
 	mutex_init(&ub->mutex);
 	spin_lock_init(&ub->mm_lock);
 	INIT_WORK(&ub->quiesce_work, ublk_quiesce_work_fn);
@@ -1726,7 +1884,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 
 	ret = ublk_alloc_dev_number(ub, header->dev_id);
 	if (ret < 0)
-		goto out_free_ub;
+		goto out_clean_srcu;
 
 	memcpy(&ub->dev_info, &info, sizeof(info));
 
@@ -1744,9 +1902,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 	if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK))
 		ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK;
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
 	ublk_align_max_io_size(ub);
@@ -1776,6 +1931,8 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 	ublk_deinit_queues(ub);
 out_free_dev_number:
 	ublk_free_dev_number(ub);
+out_clean_srcu:
+	cleanup_srcu_struct(&ub->srcu);
 out_free_ub:
 	mutex_destroy(&ub->mutex);
 	kfree(ub);
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index f6238ccc7800..a2f6748ee4ca 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -54,7 +54,36 @@
 #define UBLKSRV_IO_BUF_OFFSET	0x80000000
 
 /* tag bit is 12bit, so at most 4096 IOs for each queue */
-#define UBLK_MAX_QUEUE_DEPTH	4096
+#define UBLK_TAG_BITS		12
+#define UBLK_MAX_QUEUE_DEPTH	(1U << UBLK_TAG_BITS)
+
+/* used in ->splice_read for supporting zero-copy */
+#define UBLK_BUFS_SIZE_BITS	42
+#define UBLK_BUFS_SIZE_MASK    ((1ULL << UBLK_BUFS_SIZE_BITS) - 1)
+#define UBLK_BUF_SIZE_BITS     (UBLK_BUFS_SIZE_BITS - UBLK_TAG_BITS)
+#define UBLK_BUF_MAX_SIZE      (1ULL << UBLK_BUF_SIZE_BITS)
+
+static inline __u16 ublk_pos_to_hwq(__u64 pos)
+{
+	return pos >> UBLK_BUFS_SIZE_BITS;
+}
+
+static inline __u32 ublk_pos_to_buf_offset(__u64 pos)
+{
+	return (pos & UBLK_BUFS_SIZE_MASK) & (UBLK_BUF_MAX_SIZE - 1);
+}
+
+static inline __u16 ublk_pos_to_tag(__u64 pos)
+{
+	return (pos & UBLK_BUFS_SIZE_MASK) >> UBLK_BUF_SIZE_BITS;
+}
+
+/* offset of single buffer, which has to be < UBLK_BUX_MAX_SIZE */
+static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
+{
+	return (((__u64)q_id) << UBLK_BUFS_SIZE_BITS) |
+		((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
+}
 
 /*
  * zero copy requires 4k block size, and can remap ublk driver's io
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
                   ` (3 preceding siblings ...)
  2023-02-10 15:32 ` [PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
@ 2023-02-10 21:54 ` Jens Axboe
  2023-02-10 22:19   ` Jens Axboe
  2023-02-11  5:13   ` Ming Lei
  2023-02-14 16:36 ` Stefan Hajnoczi
  5 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2023-02-10 21:54 UTC (permalink / raw)
  To: Ming Lei, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/10/23 8:32 AM, Ming Lei wrote:
> Hello,
> 
> Add two OPs which buffer is retrieved via kernel splice for supporting
> fuse/ublk zero copy.
> 
> The 1st patch enhances direct pipe & splice for moving pages in kernel,
> so that the two added OPs won't be misused, and avoid potential security
> hole.
> 
> The 2nd patch allows splice_direct_to_actor() caller to ignore signal
> if the actor won't block and can be done in bound time.
> 
> The 3rd patch add the two OPs.
> 
> The 4th patch implements ublk's ->splice_read() for supporting
> zero copy.
> 
> ublksrv(userspace):
> 
> https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf
>     
> So far, only loop/null target implements zero copy in above branch:
>     
> 	ublk add -t loop -f $file -z
> 	ublk add -t none -z
> 
> Basic FS/IO function is verified, mount/kernel building & fio
> works fine, and big chunk IO(BS: 64k/512k) performance gets improved
> obviously.

Do you have any performance numbers? Also curious on liburing regression
tests, would be nice to see as it helps with review.

Caveat - haven't looked into this in detail just yet.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
@ 2023-02-10 22:19   ` Jens Axboe
  2023-02-11  5:13   ` Ming Lei
  1 sibling, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2023-02-10 22:19 UTC (permalink / raw)
  To: Ming Lei, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/10/23 2:54 PM, Jens Axboe wrote:
> On 2/10/23 8:32 AM, Ming Lei wrote:
>> Hello,
>>
>> Add two OPs which buffer is retrieved via kernel splice for supporting
>> fuse/ublk zero copy.
>>
>> The 1st patch enhances direct pipe & splice for moving pages in kernel,
>> so that the two added OPs won't be misused, and avoid potential security
>> hole.
>>
>> The 2nd patch allows splice_direct_to_actor() caller to ignore signal
>> if the actor won't block and can be done in bound time.
>>
>> The 3rd patch add the two OPs.
>>
>> The 4th patch implements ublk's ->splice_read() for supporting
>> zero copy.
>>
>> ublksrv(userspace):
>>
>> https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf
>>     
>> So far, only loop/null target implements zero copy in above branch:
>>     
>> 	ublk add -t loop -f $file -z
>> 	ublk add -t none -z
>>
>> Basic FS/IO function is verified, mount/kernel building & fio
>> works fine, and big chunk IO(BS: 64k/512k) performance gets improved
>> obviously.
> 
> Do you have any performance numbers? Also curious on liburing regression
> tests, would be nice to see as it helps with review.
> 
> Caveat - haven't looked into this in detail just yet.

Also see the recent splice/whatever discussion, might be something
that's relevant here, particularly if we can avoid splice:

https://lore.kernel.org/io-uring/0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org/

It's long...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
  2023-02-10 22:19   ` Jens Axboe
@ 2023-02-11  5:13   ` Ming Lei
  2023-02-11 15:45     ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-11  5:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei

On Fri, Feb 10, 2023 at 02:54:29PM -0700, Jens Axboe wrote:
> On 2/10/23 8:32 AM, Ming Lei wrote:
> > Hello,
> > 
> > Add two OPs which buffer is retrieved via kernel splice for supporting
> > fuse/ublk zero copy.
> > 
> > The 1st patch enhances direct pipe & splice for moving pages in kernel,
> > so that the two added OPs won't be misused, and avoid potential security
> > hole.
> > 
> > The 2nd patch allows splice_direct_to_actor() caller to ignore signal
> > if the actor won't block and can be done in bound time.
> > 
> > The 3rd patch add the two OPs.
> > 
> > The 4th patch implements ublk's ->splice_read() for supporting
> > zero copy.
> > 
> > ublksrv(userspace):
> > 
> > https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf
> >     
> > So far, only loop/null target implements zero copy in above branch:
> >     
> > 	ublk add -t loop -f $file -z
> > 	ublk add -t none -z
> > 
> > Basic FS/IO function is verified, mount/kernel building & fio
> > works fine, and big chunk IO(BS: 64k/512k) performance gets improved
> > obviously.
> 
> Do you have any performance numbers?

Simple test on ublk-loop over image in btrfs shows the improvement is
100% ~ 200%.

> Also curious on liburing regression
> tests, would be nice to see as it helps with review.

It isn't easy since it requires ublk device so far, it looks like
read to/write from one device buffer.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
@ 2023-02-11 15:42   ` Ming Lei
  2023-02-11 18:57     ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-11 15:42 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, Linus Torvalds

On Fri, Feb 10, 2023 at 11:32:09PM +0800, Ming Lei wrote:
> Per-task direct pipe can transfer page between two files or one
> file and other kernel components, especially by splice_direct_to_actor
> and __splice_from_pipe().
> 
> This way is helpful for fuse/ublk to implement zero copy by transferring
> pages from device to file or socket. However, when device's ->splice_read()
> produces pages, the kernel consumer may read from or write to these pages,
> and from device viewpoint, there could be unexpected read or write on
> pages.
> 
> Enhance the limit by the following approach:
> 
> 1) add kernel splice flags of SPLICE_F_KERN_FOR_[READ|WRITE] which is
>    passed to device's ->read_splice(), then device can check if this
>    READ or WRITE is expected on pages filled to pipe together with
>    information from ppos & len
> 
> 2) add kernel splice flag of SPLICE_F_KERN_NEED_CONFIRM which is passed
>    to device's ->read_splice() for asking device to confirm if it
>    really supports this kind of usage of feeding pages by ->read_splice().
>    If device does support, pipe->ack_page_consuming is set. This way
>    can avoid misuse.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  fs/splice.c               | 15 +++++++++++++++
>  include/linux/pipe_fs_i.h | 10 ++++++++++
>  include/linux/splice.h    | 22 ++++++++++++++++++++++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 87d9b19349de..c4770e1644cc 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -792,6 +792,14 @@ static long do_splice_to(struct file *in, loff_t *ppos,
>  	return in->f_op->splice_read(in, ppos, pipe, len, flags);
>  }
>  
> +static inline bool slice_read_acked(const struct pipe_inode_info *pipe,
> +			       int flags)
> +{
> +	if (flags & SPLICE_F_KERN_NEED_CONFIRM)
> +		return pipe->ack_page_consuming;
> +	return true;
> +}
> +
>  /**
>   * splice_direct_to_actor - splices data directly between two non-pipes
>   * @in:		file to splice from
> @@ -861,10 +869,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
>  		size_t read_len;
>  		loff_t pos = sd->pos, prev_pos = pos;
>  
> +		pipe->ack_page_consuming = false;
>  		ret = do_splice_to(in, &pos, pipe, len, flags);
>  		if (unlikely(ret <= 0))
>  			goto out_release;
>  
> +		if (!slice_read_acked(pipe, flags)) {
> +			bytes = 0;
> +			ret = -EACCES;
> +			goto out_release;
> +		}
> +
>  		read_len = ret;
>  		sd->total_len = read_len;
>  
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 6cb65df3e3ba..09ee1a9380ec 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -72,6 +72,7 @@ struct pipe_inode_info {
>  	unsigned int r_counter;
>  	unsigned int w_counter;
>  	bool poll_usage;
> +	bool ack_page_consuming;	/* only for direct pipe */
>  	struct page *tmp_page;
>  	struct fasync_struct *fasync_readers;
>  	struct fasync_struct *fasync_writers;
> @@ -218,6 +219,15 @@ static inline void pipe_discard_from(struct pipe_inode_info *pipe,
>  		pipe_buf_release(pipe, &pipe->bufs[--pipe->head & mask]);
>  }
>  
> +/*
> + * Called in ->splice_read() for confirming the READ/WRITE page is allowed
> + */
> +static inline void pipe_ack_page_consume(struct pipe_inode_info *pipe)
> +{
> +	if (!WARN_ON_ONCE(current->splice_pipe != pipe))
> +		pipe->ack_page_consuming = true;
> +}
> +
>  /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
>     memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
>  #define PIPE_SIZE		PAGE_SIZE
> diff --git a/include/linux/splice.h b/include/linux/splice.h
> index a55179fd60fc..98c471fd918d 100644
> --- a/include/linux/splice.h
> +++ b/include/linux/splice.h
> @@ -23,6 +23,28 @@
>  
>  #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
>  
> +/*
> + * Flags used for kernel internal page move from ->splice_read()
> + * by internal direct pipe, and user pipe can't touch these
> + * flags.
> + *
> + * Pages filled from ->splice_read() are usually moved/copied to
> + * ->splice_write(). Here address fuse/ublk zero copy by transferring
> + * page from device to file/socket for either READ or WRITE. So we
> + * need ->splice_read() to confirm if this READ/WRITE is allowed on
> + * pages filled in ->splice_read().
> + */
> +/* The page consumer is for READ from pages moved from direct pipe */
> +#define SPLICE_F_KERN_FOR_READ	(0x100)
> +/* The page consumer is for WRITE to pages moved from direct pipe */
> +#define SPLICE_F_KERN_FOR_WRITE	(0x200)
> +/*
> + * ->splice_read() has to confirm if consumer's READ/WRITE on pages
> + * is allow. If yes, ->splice_read() has to set pipe->ack_page_consuming,
> + * otherwise pipe->ack_page_consuming has to be cleared.
> + */
> +#define SPLICE_F_KERN_NEED_CONFIRM	(0x400)
> +

As Linus commented in another thread, this patch is really ugly.

I'd suggest to change to the following one, any comment is welcome!


From fb9340ce72a1c58c9428d2af7cb00b55fa4ba799 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Fri, 10 Feb 2023 09:16:46 +0000
Subject: [PATCH 2/4] fs/splice: add private flags for cross-check in two ends

Pages are transferred via pipe/splice between different subsystems.

The source subsystem may know if spliced pages are readable or
writable. The sink subsystem may know if spliced pages need to
write to or read from.

Add two pair of private flags, so that the source subsystem and
sink subsystem can run cross-check about page use correctness,
and they are supposed to be used in case of communicating over
direct pipe only. Generic splice and pipe code is supposed to
not touch the two pair of private flags.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/pipe_fs_i.h | 8 ++++++++
 include/linux/splice.h    | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 6cb65df3e3ba..959c8b9287f4 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,14 @@
 #define PIPE_BUF_FLAG_LOSS	0x40	/* Message loss happened after this buffer */
 #endif
 
+/*
+ * Used by source/sink end only, don't touch them in generic
+ * splice/pipe code. Set in source side, and check in sink
+ * side
+ */
+#define PIPE_BUF_PRIV_FLAG_MAY_READ	0x1000 /* sink can read from page */
+#define PIPE_BUF_PRIV_FLAG_MAY_WRITE	0x2000 /* sink can write to page */
+
 /**
  *	struct pipe_buffer - a linux kernel pipe buffer
  *	@page: the page containing the data for the pipe buffer
diff --git a/include/linux/splice.h b/include/linux/splice.h
index a55179fd60fc..90d1d2b5327d 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -23,6 +23,15 @@
 
 #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
 
+/*
+ * Use for source/sink side only, don't touch them in generic splice/pipe
+ * code. Pass from sink side, and check in source side.
+ *
+ * So far, it is only for communicating over direct pipe.
+ */
+#define SPLICE_F_PRIV_FOR_READ	(0x100)	/* sink side will read from page */
+#define SPLICE_F_PRIV_FOR_WRITE	(0x200) /* sink side will write page */
+
 /*
  * Passed to the actors
  */
-- 
2.38.1



Thanks,
Ming


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
@ 2023-02-11 15:45   ` Jens Axboe
  2023-02-11 16:12     ` Ming Lei
  2023-02-11 17:13   ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2023-02-11 15:45 UTC (permalink / raw)
  To: Ming Lei, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/10/23 8:32?AM, Ming Lei wrote:
> IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> for building buffer.
> 
> IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> for building buffer.
> 
> The typical use case is for supporting ublk/fuse io_uring zero copy,
> and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
> from device->read_splice(), then READ/WRITE can be done to/from this
> buffer directly.

Main question here - would this be better not plumbed up through the rw
path? Might be cleaner, even if it either requires a bit of helper
refactoring or accepting a bit of duplication. But would still be better
than polluting the rw fast path imho.

Also seems like this should be separately testable. We can't add new
opcodes that don't have a feature test at least, and should also have
various corner case tests. A bit of commenting outside of this below.

> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index 5238ecd7af6a..91e8d8f96134 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
>  		.prep			= io_eopnotsupp_prep,
>  #endif
>  	},
> +	[IORING_OP_READ_SPLICE_BUF] = {
> +		.needs_file		= 1,
> +		.unbound_nonreg_file	= 1,
> +		.pollin			= 1,
> +		.plug			= 1,
> +		.audit_skip		= 1,
> +		.ioprio			= 1,
> +		.iopoll			= 1,
> +		.iopoll_queue		= 1,
> +		.prep			= io_prep_rw,
> +		.issue			= io_read,
> +	},
> +	[IORING_OP_WRITE_SPLICE_BUF] = {
> +		.needs_file		= 1,
> +		.hash_reg_file		= 1,
> +		.unbound_nonreg_file	= 1,
> +		.pollout		= 1,
> +		.plug			= 1,
> +		.audit_skip		= 1,
> +		.ioprio			= 1,
> +		.iopoll			= 1,
> +		.iopoll_queue		= 1,
> +		.prep			= io_prep_rw,
> +		.issue			= io_write,
> +	},

Are these really safe with iopoll?

> +static int io_prep_rw_splice_buf(struct io_kiocb *req,
> +				 const struct io_uring_sqe *sqe)
> +{
> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> +	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
> +	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
> +	struct io_rw_splice_buf_data data;
> +	struct io_mapped_ubuf *imu;
> +	struct fd splice_fd;
> +	int ret;
> +
> +	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
> +	if (!splice_fd.file)
> +		return -EBADF;

Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
io_file_get_normal() for the non-fixed case in case someone passed in an
io_uring fd.

> +	data.imu = &imu;
> +
> +	rw->addr = 0;
> +	req->flags |= REQ_F_NEED_CLEANUP;
> +
> +	ret = __io_prep_rw_splice_buf(req, &data, splice_fd.file, rw->len,
> +			splice_off);
> +	imu = *data.imu;
> +	imu->acct_pages = 0;
> +	imu->ubuf = 0;
> +	imu->ubuf_end = data.total;
> +	rw->len = data.total;
> +	req->imu = imu;
> +	if (!data.total) {
> +		io_rw_cleanup_splice_buf(req);
> +	} else  {
> +		ret = 0;
> +	}
> +out_put_fd:
> +	if (splice_fd.file)
> +		fdput(splice_fd);
> +
> +	return ret;
> +}

If the operation is done, clear NEED_CLEANUP and do the cleanup here?
That'll be faster.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-11  5:13   ` Ming Lei
@ 2023-02-11 15:45     ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2023-02-11 15:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/10/23 10:13 PM, Ming Lei wrote:
> On Fri, Feb 10, 2023 at 02:54:29PM -0700, Jens Axboe wrote:
>> On 2/10/23 8:32 AM, Ming Lei wrote:
>>> Hello,
>>>
>>> Add two OPs which buffer is retrieved via kernel splice for supporting
>>> fuse/ublk zero copy.
>>>
>>> The 1st patch enhances direct pipe & splice for moving pages in kernel,
>>> so that the two added OPs won't be misused, and avoid potential security
>>> hole.
>>>
>>> The 2nd patch allows splice_direct_to_actor() caller to ignore signal
>>> if the actor won't block and can be done in bound time.
>>>
>>> The 3rd patch add the two OPs.
>>>
>>> The 4th patch implements ublk's ->splice_read() for supporting
>>> zero copy.
>>>
>>> ublksrv(userspace):
>>>
>>> https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf
>>>     
>>> So far, only loop/null target implements zero copy in above branch:
>>>     
>>> 	ublk add -t loop -f $file -z
>>> 	ublk add -t none -z
>>>
>>> Basic FS/IO function is verified, mount/kernel building & fio
>>> works fine, and big chunk IO(BS: 64k/512k) performance gets improved
>>> obviously.
>>
>> Do you have any performance numbers?
> 
> Simple test on ublk-loop over image in btrfs shows the improvement is
> 100% ~ 200%.

That is pretty tasty...

>> Also curious on liburing regression
>> tests, would be nice to see as it helps with review.
> 
> It isn't easy since it requires ublk device so far, it looks like
> read to/write from one device buffer.

It can't be tested without ublk itself? Surely the two new added ops can
have separate test cases?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-11 15:45   ` Jens Axboe
@ 2023-02-11 16:12     ` Ming Lei
  2023-02-11 16:52       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-11 16:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei

On Sat, Feb 11, 2023 at 08:45:18AM -0700, Jens Axboe wrote:
> On 2/10/23 8:32?AM, Ming Lei wrote:
> > IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
> > ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> > for building buffer.
> > 
> > IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
> > ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> > for building buffer.
> > 
> > The typical use case is for supporting ublk/fuse io_uring zero copy,
> > and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
> > from device->read_splice(), then READ/WRITE can be done to/from this
> > buffer directly.
> 
> Main question here - would this be better not plumbed up through the rw
> path? Might be cleaner, even if it either requires a bit of helper
> refactoring or accepting a bit of duplication. But would still be better
> than polluting the rw fast path imho.

The buffer is actually IO buffer, which has to be plumbed up in IO path,
and it can't be done like the registered buffer.

The only affect on fast path is :

		if (io_rw_splice_buf(req))	//which just check opcode
              return io_prep_rw_splice_buf(req, sqe);

and the cleanup code which is only done for the two new OPs.

Or maybe I misunderstand your point? Or any detailed suggestion?

Actually the code should be factored into generic helper, since net.c
need to use them too. Probably it needs to move to rsrc.c?

> 
> Also seems like this should be separately testable. We can't add new
> opcodes that don't have a feature test at least, and should also have
> various corner case tests. A bit of commenting outside of this below.

OK, I will write/add one very simple ublk userspace to liburing for
test purpose.

> 
> > diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> > index 5238ecd7af6a..91e8d8f96134 100644
> > --- a/io_uring/opdef.c
> > +++ b/io_uring/opdef.c
> > @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
> >  		.prep			= io_eopnotsupp_prep,
> >  #endif
> >  	},
> > +	[IORING_OP_READ_SPLICE_BUF] = {
> > +		.needs_file		= 1,
> > +		.unbound_nonreg_file	= 1,
> > +		.pollin			= 1,
> > +		.plug			= 1,
> > +		.audit_skip		= 1,
> > +		.ioprio			= 1,
> > +		.iopoll			= 1,
> > +		.iopoll_queue		= 1,
> > +		.prep			= io_prep_rw,
> > +		.issue			= io_read,
> > +	},
> > +	[IORING_OP_WRITE_SPLICE_BUF] = {
> > +		.needs_file		= 1,
> > +		.hash_reg_file		= 1,
> > +		.unbound_nonreg_file	= 1,
> > +		.pollout		= 1,
> > +		.plug			= 1,
> > +		.audit_skip		= 1,
> > +		.ioprio			= 1,
> > +		.iopoll			= 1,
> > +		.iopoll_queue		= 1,
> > +		.prep			= io_prep_rw,
> > +		.issue			= io_write,
> > +	},
> 
> Are these really safe with iopoll?

Yeah, after the buffer is built, the handling is basically
same with IORING_OP_WRITE_FIXED, so I think it is safe.

> 
> > +static int io_prep_rw_splice_buf(struct io_kiocb *req,
> > +				 const struct io_uring_sqe *sqe)
> > +{
> > +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> > +	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
> > +	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
> > +	struct io_rw_splice_buf_data data;
> > +	struct io_mapped_ubuf *imu;
> > +	struct fd splice_fd;
> > +	int ret;
> > +
> > +	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
> > +	if (!splice_fd.file)
> > +		return -EBADF;
> 
> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> io_file_get_normal() for the non-fixed case in case someone passed in an
> io_uring fd.

SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
we can use sqe->addr3, I think it is doable.

> 
> > +	data.imu = &imu;
> > +
> > +	rw->addr = 0;
> > +	req->flags |= REQ_F_NEED_CLEANUP;
> > +
> > +	ret = __io_prep_rw_splice_buf(req, &data, splice_fd.file, rw->len,
> > +			splice_off);
> > +	imu = *data.imu;
> > +	imu->acct_pages = 0;
> > +	imu->ubuf = 0;
> > +	imu->ubuf_end = data.total;
> > +	rw->len = data.total;
> > +	req->imu = imu;
> > +	if (!data.total) {
> > +		io_rw_cleanup_splice_buf(req);
> > +	} else  {
> > +		ret = 0;
> > +	}
> > +out_put_fd:
> > +	if (splice_fd.file)
> > +		fdput(splice_fd);
> > +
> > +	return ret;
> > +}
> 
> If the operation is done, clear NEED_CLEANUP and do the cleanup here?
> That'll be faster.

The buffer has to be cleaned up after req is completed, since bvec
table is needed for bio, and page reference need to be dropped after
IO is done too.


thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-11 16:12     ` Ming Lei
@ 2023-02-11 16:52       ` Jens Axboe
  2023-02-12  3:22         ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2023-02-11 16:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/11/23 9:12?AM, Ming Lei wrote:
> On Sat, Feb 11, 2023 at 08:45:18AM -0700, Jens Axboe wrote:
>> On 2/10/23 8:32?AM, Ming Lei wrote:
>>> IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
>>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
>>> for building buffer.
>>>
>>> IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
>>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
>>> for building buffer.
>>>
>>> The typical use case is for supporting ublk/fuse io_uring zero copy,
>>> and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
>>> from device->read_splice(), then READ/WRITE can be done to/from this
>>> buffer directly.
>>
>> Main question here - would this be better not plumbed up through the rw
>> path? Might be cleaner, even if it either requires a bit of helper
>> refactoring or accepting a bit of duplication. But would still be better
>> than polluting the rw fast path imho.
> 
> The buffer is actually IO buffer, which has to be plumbed up in IO path,
> and it can't be done like the registered buffer.
> 
> The only affect on fast path is :
> 
> 		if (io_rw_splice_buf(req))	//which just check opcode
>               return io_prep_rw_splice_buf(req, sqe);
> 
> and the cleanup code which is only done for the two new OPs.
> 
> Or maybe I misunderstand your point? Or any detailed suggestion?
> 
> Actually the code should be factored into generic helper, since net.c
> need to use them too. Probably it needs to move to rsrc.c?

Yep, just refactoring out those bits as a prep thing. rsrc could work,
or perhaps a new file for that.

>> Also seems like this should be separately testable. We can't add new
>> opcodes that don't have a feature test at least, and should also have
>> various corner case tests. A bit of commenting outside of this below.
> 
> OK, I will write/add one very simple ublk userspace to liburing for
> test purpose.

Thanks!

>>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>>> index 5238ecd7af6a..91e8d8f96134 100644
>>> --- a/io_uring/opdef.c
>>> +++ b/io_uring/opdef.c
>>> @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
>>>  		.prep			= io_eopnotsupp_prep,
>>>  #endif
>>>  	},
>>> +	[IORING_OP_READ_SPLICE_BUF] = {
>>> +		.needs_file		= 1,
>>> +		.unbound_nonreg_file	= 1,
>>> +		.pollin			= 1,
>>> +		.plug			= 1,
>>> +		.audit_skip		= 1,
>>> +		.ioprio			= 1,
>>> +		.iopoll			= 1,
>>> +		.iopoll_queue		= 1,
>>> +		.prep			= io_prep_rw,
>>> +		.issue			= io_read,
>>> +	},
>>> +	[IORING_OP_WRITE_SPLICE_BUF] = {
>>> +		.needs_file		= 1,
>>> +		.hash_reg_file		= 1,
>>> +		.unbound_nonreg_file	= 1,
>>> +		.pollout		= 1,
>>> +		.plug			= 1,
>>> +		.audit_skip		= 1,
>>> +		.ioprio			= 1,
>>> +		.iopoll			= 1,
>>> +		.iopoll_queue		= 1,
>>> +		.prep			= io_prep_rw,
>>> +		.issue			= io_write,
>>> +	},
>>
>> Are these really safe with iopoll?
> 
> Yeah, after the buffer is built, the handling is basically
> same with IORING_OP_WRITE_FIXED, so I think it is safe.

Yeah, on a second look, as these are just using the normal read/write
path after that should be fine indeed.

>>
>>> +static int io_prep_rw_splice_buf(struct io_kiocb *req,
>>> +				 const struct io_uring_sqe *sqe)
>>> +{
>>> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>> +	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
>>> +	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
>>> +	struct io_rw_splice_buf_data data;
>>> +	struct io_mapped_ubuf *imu;
>>> +	struct fd splice_fd;
>>> +	int ret;
>>> +
>>> +	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
>>> +	if (!splice_fd.file)
>>> +		return -EBADF;
>>
>> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
>> io_file_get_normal() for the non-fixed case in case someone passed in an
>> io_uring fd.
> 
> SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> we can use sqe->addr3, I think it is doable.

I haven't checked the rest, but you can't just use ->splice_flags for
this?

In any case, the get path needs to look like io_tee() here, and:

>>> +out_put_fd:
>>> +	if (splice_fd.file)
>>> +		fdput(splice_fd);

this put needs to be gated on whether it's a fixed file or not.

>> If the operation is done, clear NEED_CLEANUP and do the cleanup here?
>> That'll be faster.
> 
> The buffer has to be cleaned up after req is completed, since bvec
> table is needed for bio, and page reference need to be dropped after
> IO is done too.

I mean when you clear that flag, call the cleanup bits you otherwise
would've called on later cleanup.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
  2023-02-11 15:45   ` Jens Axboe
@ 2023-02-11 17:13   ` Jens Axboe
  2023-02-12  1:48     ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2023-02-11 17:13 UTC (permalink / raw)
  To: Ming Lei, io-uring, linux-block, linux-fsdevel, Alexander Viro
  Cc: Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/10/23 8:32?AM, Ming Lei wrote:

One more comment on this.

> +static int __io_prep_rw_splice_buf(struct io_kiocb *req,
> +				   struct io_rw_splice_buf_data *data,
> +				   struct file *splice_f,
> +				   size_t len,
> +				   loff_t splice_off)
> +{
> +	unsigned flags = req->opcode == IORING_OP_READ_SPLICE_BUF ?
> +			SPLICE_F_KERN_FOR_READ : SPLICE_F_KERN_FOR_WRITE;
> +	struct splice_desc sd = {
> +		.total_len = len,
> +		.flags = flags | SPLICE_F_NONBLOCK | SPLICE_F_KERN_NEED_CONFIRM,
> +		.pos = splice_off,
> +		.u.data = data,
> +		.ignore_sig = true,
> +	};
> +
> +	return splice_direct_to_actor(splice_f, &sd,
> +			io_splice_buf_direct_actor);

Is this safe? We end up using current->splice_pipe here, which should be
fine as long as things are left in a sane state after every operation.
Which they should be, just like a syscall would. Just wanted to make
sure you've considered that part.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-11 15:42   ` Ming Lei
@ 2023-02-11 18:57     ` Linus Torvalds
  2023-02-12  1:39       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2023-02-11 18:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On Sat, Feb 11, 2023 at 7:42 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> +/*
> + * Used by source/sink end only, don't touch them in generic
> + * splice/pipe code. Set in source side, and check in sink
> + * side
> + */
> +#define PIPE_BUF_PRIV_FLAG_MAY_READ    0x1000 /* sink can read from page */
> +#define PIPE_BUF_PRIV_FLAG_MAY_WRITE   0x2000 /* sink can write to page */
> +

So this sounds much more sane and understandable, but I have two worries:

 (a) what's the point of MAY_READ? A non-readable page sounds insane
and wrong. All sinks expect to be able to read.

 (b) what about 'tee()' which duplicates a pipe buffer that has
MAY_WRITE? Particularly one that may already have been *partially*
given to something that thinks it can write to it?

So at a minimum I think the tee code then needs to clear that flag.
And I think MAY_READ is nonsense.

          Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-11 18:57     ` Linus Torvalds
@ 2023-02-12  1:39       ` Ming Lei
  2023-02-13 20:04         ` Linus Torvalds
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-12  1:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei

On Sat, Feb 11, 2023 at 10:57:08AM -0800, Linus Torvalds wrote:
> On Sat, Feb 11, 2023 at 7:42 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > +/*
> > + * Used by source/sink end only, don't touch them in generic
> > + * splice/pipe code. Set in source side, and check in sink
> > + * side
> > + */
> > +#define PIPE_BUF_PRIV_FLAG_MAY_READ    0x1000 /* sink can read from page */
> > +#define PIPE_BUF_PRIV_FLAG_MAY_WRITE   0x2000 /* sink can write to page */
> > +
> 
> So this sounds much more sane and understandable, but I have two worries:
> 
>  (a) what's the point of MAY_READ? A non-readable page sounds insane
> and wrong. All sinks expect to be able to read.

For example, it is one page which needs sink end to fill data, so
we needn't to zero it in source end every time, just for avoiding
leak kernel data if (unexpected)sink end simply tried to read from
the spliced page instead of writing data to page.

> 
>  (b) what about 'tee()' which duplicates a pipe buffer that has
> MAY_WRITE? Particularly one that may already have been *partially*
> given to something that thinks it can write to it?

The flags added is for kernel code(io_uring) over direct pipe,
and the two pipe buf flags won't be copied cross tee, cause the kernel
code just use spliced pages. (io_uring -> tee())

Also because of the added SPLICE_F_PRIV_FOR_READ[WRITE], pipe buffer
flags won't be copied over tee() too, because tee() can't pass the two
flags to device's ->splice_read().

We may need to document that the two pair flags should be used
together.

thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-11 17:13   ` Jens Axboe
@ 2023-02-12  1:48     ` Ming Lei
  2023-02-12  2:42       ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-12  1:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei

On Sat, Feb 11, 2023 at 10:13:37AM -0700, Jens Axboe wrote:
> On 2/10/23 8:32?AM, Ming Lei wrote:
> 
> One more comment on this.
> 
> > +static int __io_prep_rw_splice_buf(struct io_kiocb *req,
> > +				   struct io_rw_splice_buf_data *data,
> > +				   struct file *splice_f,
> > +				   size_t len,
> > +				   loff_t splice_off)
> > +{
> > +	unsigned flags = req->opcode == IORING_OP_READ_SPLICE_BUF ?
> > +			SPLICE_F_KERN_FOR_READ : SPLICE_F_KERN_FOR_WRITE;
> > +	struct splice_desc sd = {
> > +		.total_len = len,
> > +		.flags = flags | SPLICE_F_NONBLOCK | SPLICE_F_KERN_NEED_CONFIRM,
> > +		.pos = splice_off,
> > +		.u.data = data,
> > +		.ignore_sig = true,
> > +	};
> > +
> > +	return splice_direct_to_actor(splice_f, &sd,
> > +			io_splice_buf_direct_actor);
> 
> Is this safe? We end up using current->splice_pipe here, which should be
> fine as long as things are left in a sane state after every operation.
> Which they should be, just like a syscall would. Just wanted to make
> sure you've considered that part.

Yeah.

Direct pipe is always left as empty when splice_direct_to_actor()
returns. Pipe buffers(pages) are produced from ->splice_read()
called from splice_direct_to_actor(), and consumed by
io_splice_buf_direct_actor().

If any error is returned, direct pipe is empty too, and we just
need to drop reference of sliced pages by io_rw_cleanup_splice_buf().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-12  1:48     ` Ming Lei
@ 2023-02-12  2:42       ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2023-02-12  2:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/11/23 6:48 PM, Ming Lei wrote:
> On Sat, Feb 11, 2023 at 10:13:37AM -0700, Jens Axboe wrote:
>> On 2/10/23 8:32?AM, Ming Lei wrote:
>>
>> One more comment on this.
>>
>>> +static int __io_prep_rw_splice_buf(struct io_kiocb *req,
>>> +				   struct io_rw_splice_buf_data *data,
>>> +				   struct file *splice_f,
>>> +				   size_t len,
>>> +				   loff_t splice_off)
>>> +{
>>> +	unsigned flags = req->opcode == IORING_OP_READ_SPLICE_BUF ?
>>> +			SPLICE_F_KERN_FOR_READ : SPLICE_F_KERN_FOR_WRITE;
>>> +	struct splice_desc sd = {
>>> +		.total_len = len,
>>> +		.flags = flags | SPLICE_F_NONBLOCK | SPLICE_F_KERN_NEED_CONFIRM,
>>> +		.pos = splice_off,
>>> +		.u.data = data,
>>> +		.ignore_sig = true,
>>> +	};
>>> +
>>> +	return splice_direct_to_actor(splice_f, &sd,
>>> +			io_splice_buf_direct_actor);
>>
>> Is this safe? We end up using current->splice_pipe here, which should be
>> fine as long as things are left in a sane state after every operation.
>> Which they should be, just like a syscall would. Just wanted to make
>> sure you've considered that part.
> 
> Yeah.
> 
> Direct pipe is always left as empty when splice_direct_to_actor()
> returns. Pipe buffers(pages) are produced from ->splice_read()
> called from splice_direct_to_actor(), and consumed by
> io_splice_buf_direct_actor().
> 
> If any error is returned, direct pipe is empty too, and we just
> need to drop reference of sliced pages by io_rw_cleanup_splice_buf().

OK thanks for confirming, then that should be fine as we can
obviously not have two syscalls (or sendfile(2) and task_work from
io_uring) running at the same time.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-11 16:52       ` Jens Axboe
@ 2023-02-12  3:22         ` Ming Lei
  2023-02-12  3:55           ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-12  3:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei

On Sat, Feb 11, 2023 at 09:52:58AM -0700, Jens Axboe wrote:
> On 2/11/23 9:12?AM, Ming Lei wrote:
> > On Sat, Feb 11, 2023 at 08:45:18AM -0700, Jens Axboe wrote:
> >> On 2/10/23 8:32?AM, Ming Lei wrote:
> >>> IORING_OP_READ_SPLICE_BUF: read to buffer which is built from
> >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> >>> for building buffer.
> >>>
> >>> IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from
> >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len)
> >>> for building buffer.
> >>>
> >>> The typical use case is for supporting ublk/fuse io_uring zero copy,
> >>> and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe
> >>> from device->read_splice(), then READ/WRITE can be done to/from this
> >>> buffer directly.
> >>
> >> Main question here - would this be better not plumbed up through the rw
> >> path? Might be cleaner, even if it either requires a bit of helper
> >> refactoring or accepting a bit of duplication. But would still be better
> >> than polluting the rw fast path imho.
> > 
> > The buffer is actually IO buffer, which has to be plumbed up in IO path,
> > and it can't be done like the registered buffer.
> > 
> > The only affect on fast path is :
> > 
> > 		if (io_rw_splice_buf(req))	//which just check opcode
> >               return io_prep_rw_splice_buf(req, sqe);
> > 
> > and the cleanup code which is only done for the two new OPs.
> > 
> > Or maybe I misunderstand your point? Or any detailed suggestion?
> > 
> > Actually the code should be factored into generic helper, since net.c
> > need to use them too. Probably it needs to move to rsrc.c?
> 
> Yep, just refactoring out those bits as a prep thing. rsrc could work,
> or perhaps a new file for that.

OK.

> 
> >> Also seems like this should be separately testable. We can't add new
> >> opcodes that don't have a feature test at least, and should also have
> >> various corner case tests. A bit of commenting outside of this below.
> > 
> > OK, I will write/add one very simple ublk userspace to liburing for
> > test purpose.
> 
> Thanks!

Thinking of further, if we use ublk for liburing test purpose, root is
often needed, even though we support un-privileged mode, which needs
administrator to grant access, so is it still good to do so?

It could be easier to add ->splice_read() on /dev/zero for test
purpose, just allocate zeroed pages in ->splice_read(), and add
them to pipe like ublk->splice_read(), and sink side can read
from or write to these pages, but zero's read_iter_zero() won't
be affected. And normal splice/tee won't connect to zero too
because we only allow it from kernel use.

> 
> >>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> >>> index 5238ecd7af6a..91e8d8f96134 100644
> >>> --- a/io_uring/opdef.c
> >>> +++ b/io_uring/opdef.c
> >>> @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = {
> >>>  		.prep			= io_eopnotsupp_prep,
> >>>  #endif
> >>>  	},
> >>> +	[IORING_OP_READ_SPLICE_BUF] = {
> >>> +		.needs_file		= 1,
> >>> +		.unbound_nonreg_file	= 1,
> >>> +		.pollin			= 1,
> >>> +		.plug			= 1,
> >>> +		.audit_skip		= 1,
> >>> +		.ioprio			= 1,
> >>> +		.iopoll			= 1,
> >>> +		.iopoll_queue		= 1,
> >>> +		.prep			= io_prep_rw,
> >>> +		.issue			= io_read,
> >>> +	},
> >>> +	[IORING_OP_WRITE_SPLICE_BUF] = {
> >>> +		.needs_file		= 1,
> >>> +		.hash_reg_file		= 1,
> >>> +		.unbound_nonreg_file	= 1,
> >>> +		.pollout		= 1,
> >>> +		.plug			= 1,
> >>> +		.audit_skip		= 1,
> >>> +		.ioprio			= 1,
> >>> +		.iopoll			= 1,
> >>> +		.iopoll_queue		= 1,
> >>> +		.prep			= io_prep_rw,
> >>> +		.issue			= io_write,
> >>> +	},
> >>
> >> Are these really safe with iopoll?
> > 
> > Yeah, after the buffer is built, the handling is basically
> > same with IORING_OP_WRITE_FIXED, so I think it is safe.
> 
> Yeah, on a second look, as these are just using the normal read/write
> path after that should be fine indeed.
> 
> >>
> >>> +static int io_prep_rw_splice_buf(struct io_kiocb *req,
> >>> +				 const struct io_uring_sqe *sqe)
> >>> +{
> >>> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> >>> +	unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len);
> >>> +	loff_t splice_off = READ_ONCE(sqe->splice_off_in);
> >>> +	struct io_rw_splice_buf_data data;
> >>> +	struct io_mapped_ubuf *imu;
> >>> +	struct fd splice_fd;
> >>> +	int ret;
> >>> +
> >>> +	splice_fd = fdget(READ_ONCE(sqe->splice_fd_in));
> >>> +	if (!splice_fd.file)
> >>> +		return -EBADF;
> >>
> >> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> >> io_file_get_normal() for the non-fixed case in case someone passed in an
> >> io_uring fd.
> > 
> > SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> > we can use sqe->addr3, I think it is doable.
> 
> I haven't checked the rest, but you can't just use ->splice_flags for
> this?

->splice_flags shares memory with rwflags, so can't be used.

I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
io_msg_ring() has used that.

> 
> In any case, the get path needs to look like io_tee() here, and:
> 
> >>> +out_put_fd:
> >>> +	if (splice_fd.file)
> >>> +		fdput(splice_fd);
> 
> this put needs to be gated on whether it's a fixed file or not.

Yeah.

> 
> >> If the operation is done, clear NEED_CLEANUP and do the cleanup here?
> >> That'll be faster.
> > 
> > The buffer has to be cleaned up after req is completed, since bvec
> > table is needed for bio, and page reference need to be dropped after
> > IO is done too.
> 
> I mean when you clear that flag, call the cleanup bits you otherwise
> would've called on later cleanup.

Got it.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-12  3:22         ` Ming Lei
@ 2023-02-12  3:55           ` Jens Axboe
  2023-02-13  1:06             ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2023-02-12  3:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On 2/11/23 8:22?PM, Ming Lei wrote:
>>>> Also seems like this should be separately testable. We can't add new
>>>> opcodes that don't have a feature test at least, and should also have
>>>> various corner case tests. A bit of commenting outside of this below.
>>>
>>> OK, I will write/add one very simple ublk userspace to liburing for
>>> test purpose.
>>
>> Thanks!
> 
> Thinking of further, if we use ublk for liburing test purpose, root is
> often needed, even though we support un-privileged mode, which needs
> administrator to grant access, so is it still good to do so?

That's fine, some tests already depend on root for certain things, like
passthrough. When I run the tests, I do a pass as both a regular user
and as root. The important bit is just that the tests skip when they are
not root rather than fail.

> It could be easier to add ->splice_read() on /dev/zero for test
> purpose, just allocate zeroed pages in ->splice_read(), and add
> them to pipe like ublk->splice_read(), and sink side can read
> from or write to these pages, but zero's read_iter_zero() won't
> be affected. And normal splice/tee won't connect to zero too
> because we only allow it from kernel use.

Arguably /dev/zero should still support splice_read() as a regression
fix as I argued to Linus, so I'd just add that as a prep patch.

>>>> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
>>>> io_file_get_normal() for the non-fixed case in case someone passed in an
>>>> io_uring fd.
>>>
>>> SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
>>> we can use sqe->addr3, I think it is doable.
>>
>> I haven't checked the rest, but you can't just use ->splice_flags for
>> this?
> 
> ->splice_flags shares memory with rwflags, so can't be used.
> 
> I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
> io_msg_ring() has used that.

This is part of the confusion, as you treat it basically like a
read/write internally, but the opcode names indicate differently. Why
not just have a separate prep helper for these and then use a layout
that makes more sense, surely rwflags aren't applicable for these
anyway? I think that'd make it a lot cleaner.

Yeah, addr3 could easily be used, but it's makes for a really confusing
command structure when the command is kinda-read but also kinda-splice.
And it arguable makes more sense to treat it as the latter, as it takes
the two fds like splice.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-12  3:55           ` Jens Axboe
@ 2023-02-13  1:06             ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2023-02-13  1:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei, Linus Torvalds

On Sat, Feb 11, 2023 at 08:55:23PM -0700, Jens Axboe wrote:
> On 2/11/23 8:22?PM, Ming Lei wrote:
> >>>> Also seems like this should be separately testable. We can't add new
> >>>> opcodes that don't have a feature test at least, and should also have
> >>>> various corner case tests. A bit of commenting outside of this below.
> >>>
> >>> OK, I will write/add one very simple ublk userspace to liburing for
> >>> test purpose.
> >>
> >> Thanks!
> > 
> > Thinking of further, if we use ublk for liburing test purpose, root is
> > often needed, even though we support un-privileged mode, which needs
> > administrator to grant access, so is it still good to do so?
> 
> That's fine, some tests already depend on root for certain things, like
> passthrough. When I run the tests, I do a pass as both a regular user
> and as root. The important bit is just that the tests skip when they are
> not root rather than fail.

OK, that is nice! I am going to write one minimized ublk userspace,
which can be used in both liburing & blktests for test purpose.

> 
> > It could be easier to add ->splice_read() on /dev/zero for test
> > purpose, just allocate zeroed pages in ->splice_read(), and add
> > them to pipe like ublk->splice_read(), and sink side can read
> > from or write to these pages, but zero's read_iter_zero() won't
> > be affected. And normal splice/tee won't connect to zero too
> > because we only allow it from kernel use.
> 
> Arguably /dev/zero should still support splice_read() as a regression
> fix as I argued to Linus, so I'd just add that as a prep patch.

Understood.

> 
> >>>> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use
> >>>> io_file_get_normal() for the non-fixed case in case someone passed in an
> >>>> io_uring fd.
> >>>
> >>> SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if
> >>> we can use sqe->addr3, I think it is doable.
> >>
> >> I haven't checked the rest, but you can't just use ->splice_flags for
> >> this?
> > 
> > ->splice_flags shares memory with rwflags, so can't be used.
> > 
> > I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/
> > io_msg_ring() has used that.
> 
> This is part of the confusion, as you treat it basically like a
> read/write internally, but the opcode names indicate differently. Why
> not just have a separate prep helper for these and then use a layout

Looks separate prep is cleaner.

> that makes more sense,surely rwflags aren't applicable for these
> anyway? I think that'd make it a lot cleaner.
> 
> Yeah, addr3 could easily be used, but it's makes for a really confusing
> command structure when the command is kinda-read but also kinda-splice.
> And it arguable makes more sense to treat it as the latter, as it takes
> the two fds like splice.

Yeah, it can be thought as one direct splice between device and generic
file, and I'd take more words to share the whole story here.

1) traditional splice is: 

file A(read file A to pipe buffer) -> pipe -> file B(write pipe buffer to file B)

which implements zero copy for 'cp file_A file_B'.

2) device splice (device -> pipe -> file)

If only for doing 'cp device file', the current splice works just fine, however
we do not have syscall for handling the following more generic scenario:

	dev(produce buffer to pipe) -> pipe -> file(consume buffer from pipe)

splice is supposed for transferring pages over pipe, so the above model
is reasonable. And we do have kernel APIs for doing the above by direct
pipe: splice_direct_to_actor & __splice_from_pipe.

That is basically what this patch does, and consumer could be READ
or WRITE. The current splice syscall only supports WRITE consumer(
write pipe buffer to file) in pipe's read end.

It can be used for fuse to support READ zero copy(fuse write zero copy
was supported in 2010, but never support read zero copy because of missing
such syscall), and for supporting ublk zero copy.

Actually it can help to implement any "virt" drivers, most of which just
talks with file or socket, or anything which can be handled in userspace
efficiently.

Follows the usage, suppose the syscall is named as dev_splice()

	1) "poll" device A for incoming request
	- "poll' just represents one kernel/user communication, once it
	returns, there is request peeked

    - device is one virt device and exposed to userspace and for
	receiving request from userspace

	2) handling driver specific logic
	   - it could be request mapping, such as logical volume manager,
		 the logic can be quite complicated to require Btree to map
		 request from A to underlying files, such as dm-thin or bcache

	   - it could be network based device, nbd, ceph RBD, drbd, ..., usb
	   over IP, .., there could be meta lookup over socket, send
	   command/recv reply, crypto enc/dec, ...

	3) dev_splice(A, A_offset, len, B, B_offset, OP)
	- A is the device
	- B is one generic file(regular file, block device, socket, ...)
	- OP is the consumer operation(could just be read/write or net
	  recv/send)
	- A(produce buffer) -> direct pipe -> B(consume the buffer from
	pipe by OP)

	4) after the above device_splice() is completed, request is
	completed, and send notification to userspace

Now we have io_uring command for handling 1) & 4) efficiently, the
test over ublk has shown this point. If 3) can be supported, the
complicated driver/device logic in 2) can be moved to userspace.
Then the whole driver can be implemented in userspace efficiently,
performance could even be better than kernel driver[1][2].

The trouble is that when dev_splice() becomes much more generic, it is
harder to define it as syscall. It could be easier with io_uring
compared with splice() syscall, but still not easy enough:

- if the sqe interface(for providing parameters) can be stable from
  beginning, or can be extended in future easily

- REQ_F_* has been used up

- may cause some trouble inside io_uring implementation, such as, how
to convert to other io_uring OP handling with the provide consumer op code.

That is why I treat it as io_uring read/write from the beginning, the other
two could be just treated as net recv/send, and only difference is that
buffer is retrieved from direct pipe via splice(device, offset, len).

So which type is better?  Treating it as dev_spice() or specific consumer
OP? If we treat it as splice, is it better to define it as one single
generic OP, or muliple OPs and each one is mapped to single consumer OP?

I am open about the interface definition, any comment/suggestion is
highly welcome!

[1] https://lore.kernel.org/linux-block/Yza1u1KfKa7ycQm0@T590/
[2] https://lore.kernel.org/lkml/20230126040822.GA2858@1wt.eu/T/

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-12  1:39       ` Ming Lei
@ 2023-02-13 20:04         ` Linus Torvalds
  2023-02-14  0:52           ` Ming Lei
  2023-02-14 11:03           ` Miklos Szeredi
  0 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-02-13 20:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On Sat, Feb 11, 2023 at 5:39 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> >
> >  (a) what's the point of MAY_READ? A non-readable page sounds insane
> > and wrong. All sinks expect to be able to read.
>
> For example, it is one page which needs sink end to fill data, so
> we needn't to zero it in source end every time, just for avoiding
> leak kernel data if (unexpected)sink end simply tried to read from
> the spliced page instead of writing data to page.

I still don't understand.

A sink *reads* the data. It doesn't write the data.

There's no point trying to deal with "if unexpectedly doing crazy
things". If a sink writes the data, the sinkm is so unbelievably buggy
that it's not even funny.

And having two flags that you then say "have to be used together" is pointless.

It's not two different flags if you can't use them separately!

So I think your explanations are anything *but* explaining what you
want. They are just strange and not sensible.

Please explain to me in small words and simple sentences what it is
you want. And no, if the explanation is "the sink wants to write to
the buffer", then that's not an explanation, it's just insanity.

We *used* to have the concept of "gifting" the buffer explicitly to
the sink, so that the sink could - instead of reading from it - decide
to just use the whole buffer as-is long term. The idea was that tthe
buffer woudl literally be moved from the source to the destination,
ownership and all.

But if that's what you want, then it's not about "sink writes". It's
literally about the splice() wanting to move not just the data, but
the whole ownership of the buffer.

Anyway, I will NAK this as long as the explanations for what the
semantics are and what you want to do don't make sense. Right now they
don't.

              Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-13 20:04         ` Linus Torvalds
@ 2023-02-14  0:52           ` Ming Lei
  2023-02-14  2:35             ` Ming Lei
  2023-02-14 11:03           ` Miklos Szeredi
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-14  0:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang, ming.lei

On Mon, Feb 13, 2023 at 12:04:27PM -0800, Linus Torvalds wrote:
> On Sat, Feb 11, 2023 at 5:39 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > >
> > >  (a) what's the point of MAY_READ? A non-readable page sounds insane
> > > and wrong. All sinks expect to be able to read.
> >
> > For example, it is one page which needs sink end to fill data, so
> > we needn't to zero it in source end every time, just for avoiding
> > leak kernel data if (unexpected)sink end simply tried to read from
> > the spliced page instead of writing data to page.
> 
> I still don't understand.
> 
> A sink *reads* the data. It doesn't write the data.
> 
> There's no point trying to deal with "if unexpectedly doing crazy
> things". If a sink writes the data, the sinkm is so unbelievably buggy
> that it's not even funny.
> 
> And having two flags that you then say "have to be used together" is pointless.

Actually I think it is fine to use the pipe buffer flags separately,
if MAY_READ/MAY_WRITE is set in source end, the sink side need to respect
it. All current in-tree source end actually implies both MAY_READ & MAY_WRITE.

> It's not two different flags if you can't use them separately!
> 
> So I think your explanations are anything *but* explaining what you
> want. They are just strange and not sensible.
> 
> Please explain to me in small words and simple sentences what it is
> you want. And no, if the explanation is "the sink wants to write to
> the buffer", then that's not an explanation, it's just insanity.
> 
> We *used* to have the concept of "gifting" the buffer explicitly to
> the sink, so that the sink could - instead of reading from it - decide
> to just use the whole buffer as-is long term. The idea was that tthe
> buffer woudl literally be moved from the source to the destination,
> ownership and all.
> 
> But if that's what you want, then it's not about "sink writes". It's
> literally about the splice() wanting to move not just the data, but
> the whole ownership of the buffer.

Yeah, it is actually transferring the buffer ownership, and looks
SPLICE_F_GIFT is exactly the case, but the driver side needs to set
QUEUE_FLAG_STABLE_WRITES for avoiding writeback to touch these pages.

Follows the idea:

file(devices(such as, fuse, ublk), produce pipe buffer) -> direct pipe -> file(consume the pipe buffer)

The 'consume' could be READ or WRITE.

So once SPLICE_F_GIFT is set from source side, the two buffer flags
aren't needed any more, right?

Please see the detailed explanation & use case in following link:

https://lore.kernel.org/linux-block/409656a0-7db5-d87c-3bb2-c05ff7af89af@kernel.dk/T/#m237e5973571b3d85df9fa519cf2c9762440009ba



Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-14  0:52           ` Ming Lei
@ 2023-02-14  2:35             ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2023-02-14  2:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert, Nitesh Shetty,
	Christoph Hellwig, Ziyang Zhang

On Tue, Feb 14, 2023 at 08:52:23AM +0800, Ming Lei wrote:
> On Mon, Feb 13, 2023 at 12:04:27PM -0800, Linus Torvalds wrote:
> > On Sat, Feb 11, 2023 at 5:39 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > >
> > > >  (a) what's the point of MAY_READ? A non-readable page sounds insane
> > > > and wrong. All sinks expect to be able to read.
> > >
> > > For example, it is one page which needs sink end to fill data, so
> > > we needn't to zero it in source end every time, just for avoiding
> > > leak kernel data if (unexpected)sink end simply tried to read from
> > > the spliced page instead of writing data to page.
> > 
> > I still don't understand.
> > 
> > A sink *reads* the data. It doesn't write the data.
> > 
> > There's no point trying to deal with "if unexpectedly doing crazy
> > things". If a sink writes the data, the sinkm is so unbelievably buggy
> > that it's not even funny.
> > 
> > And having two flags that you then say "have to be used together" is pointless.
> 
> Actually I think it is fine to use the pipe buffer flags separately,
> if MAY_READ/MAY_WRITE is set in source end, the sink side need to respect
> it. All current in-tree source end actually implies both MAY_READ & MAY_WRITE.
> 
> > It's not two different flags if you can't use them separately!
> > 
> > So I think your explanations are anything *but* explaining what you
> > want. They are just strange and not sensible.
> > 
> > Please explain to me in small words and simple sentences what it is
> > you want. And no, if the explanation is "the sink wants to write to
> > the buffer", then that's not an explanation, it's just insanity.
> > 
> > We *used* to have the concept of "gifting" the buffer explicitly to
> > the sink, so that the sink could - instead of reading from it - decide
> > to just use the whole buffer as-is long term. The idea was that tthe
> > buffer woudl literally be moved from the source to the destination,
> > ownership and all.
> > 
> > But if that's what you want, then it's not about "sink writes". It's
> > literally about the splice() wanting to move not just the data, but
> > the whole ownership of the buffer.
> 
> Yeah, it is actually transferring the buffer ownership, and looks
> SPLICE_F_GIFT is exactly the case, but the driver side needs to set
> QUEUE_FLAG_STABLE_WRITES for avoiding writeback to touch these pages.
> 
> Follows the idea:
> 
> file(devices(such as, fuse, ublk), produce pipe buffer) -> direct pipe -> file(consume the pipe buffer)
> 
> The 'consume' could be READ or WRITE.
> 
> So once SPLICE_F_GIFT is set from source side, the two buffer flags
> aren't needed any more, right?

Sorry, I meant PIPE_BUF_FLAG_GIFT actually.

> 
> Please see the detailed explanation & use case in following link:
> 
> https://lore.kernel.org/linux-block/409656a0-7db5-d87c-3bb2-c05ff7af89af@kernel.dk/T/#m237e5973571b3d85df9fa519cf2c9762440009ba
> 

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-13 20:04         ` Linus Torvalds
  2023-02-14  0:52           ` Ming Lei
@ 2023-02-14 11:03           ` Miklos Szeredi
  2023-02-14 14:35             ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Miklos Szeredi @ 2023-02-14 11:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ming Lei, Jens Axboe, io-uring, linux-block, linux-fsdevel,
	Alexander Viro, Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert,
	Nitesh Shetty, Christoph Hellwig, Ziyang Zhang

On Mon, 13 Feb 2023 at 21:04, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Feb 11, 2023 at 5:39 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > >
> > >  (a) what's the point of MAY_READ? A non-readable page sounds insane
> > > and wrong. All sinks expect to be able to read.
> >
> > For example, it is one page which needs sink end to fill data, so
> > we needn't to zero it in source end every time, just for avoiding
> > leak kernel data if (unexpected)sink end simply tried to read from
> > the spliced page instead of writing data to page.
>
> I still don't understand.
>
> A sink *reads* the data. It doesn't write the data.

I think Ming is trying to generalize splice to allow flowing data in
the opposite direction.  So yes, sink would be writing to the buffer.
And it MUST NOT be reading the data since the buffer may be
uninitialized.

The problem is how to tell the original source that the buffer is
ready?  PG_uptodate comes to mind, but pipe buffers allow partial
pages to be passed around, and there's no mechanism to describe a
partially uptodate buffer.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-14 11:03           ` Miklos Szeredi
@ 2023-02-14 14:35             ` Ming Lei
  2023-02-14 15:39               ` Miklos Szeredi
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-14 14:35 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Jens Axboe, io-uring, linux-block, linux-fsdevel,
	Alexander Viro, Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert,
	Nitesh Shetty, Christoph Hellwig, Ziyang Zhang, ming.lei

Hi Miklos,

On Tue, Feb 14, 2023 at 12:03:44PM +0100, Miklos Szeredi wrote:
> On Mon, 13 Feb 2023 at 21:04, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Sat, Feb 11, 2023 at 5:39 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > >
> > > >  (a) what's the point of MAY_READ? A non-readable page sounds insane
> > > > and wrong. All sinks expect to be able to read.
> > >
> > > For example, it is one page which needs sink end to fill data, so
> > > we needn't to zero it in source end every time, just for avoiding
> > > leak kernel data if (unexpected)sink end simply tried to read from
> > > the spliced page instead of writing data to page.
> >
> > I still don't understand.
> >
> > A sink *reads* the data. It doesn't write the data.
> 
> I think Ming is trying to generalize splice to allow flowing data in
> the opposite direction.

I think it isn't opposite direction, it is just that sink may be
WRITE to buffer, and the model is:

device(produce buffer in ->splice_read()) -> direct pipe ->
	file(consume buffer via READ or WRITE)

Follows kernel side implementation:

	splice_direct_to_actor(pipe, sd, source_actor)

	direct_actor():
		__splice_from_pipe(pipe, sd, sink_actor)

	sink_actor():
		get_page()

then read from file/socket to page.

The current userspace owns the whole buffer, so I understand the buffer
ownership can be transferred to consumer/sink side.

> So yes, sink would be writing to the buffer.
> And it MUST NOT be reading the data since the buffer may be
> uninitialized.

The added SPLICE_F_PRIV_FOR_READ[WRITE] is enough to avoid
un-expected READ, but the source end needs to confirm the buffer
ownership can be transferred to consumer, probably PIPE_BUF_FLAG_GIFT
can be used for this purpose.

> 
> The problem is how to tell the original source that the buffer is
> ready?  PG_uptodate comes to mind, but pipe buffers allow partial
> pages to be passed around, and there's no mechanism to describe a
> partially uptodate buffer.

I understand it isn't one issue from block device driver viewpoint at
least, since the WRITE to buffer in sink end can be thought as DMA
to buffer from device, and it is the upper layer(FS)'s responsibility
for updating page flag. And block driver needn't to handle page
status update.

So seems it is one fuse specific issue?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-14 14:35             ` Ming Lei
@ 2023-02-14 15:39               ` Miklos Szeredi
  2023-02-15  0:11                 ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Miklos Szeredi @ 2023-02-14 15:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Linus Torvalds, Jens Axboe, io-uring, linux-block, linux-fsdevel,
	Alexander Viro, Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert,
	Nitesh Shetty, Christoph Hellwig, Ziyang Zhang

On Tue, 14 Feb 2023 at 15:35, Ming Lei <ming.lei@redhat.com> wrote:

> I understand it isn't one issue from block device driver viewpoint at
> least, since the WRITE to buffer in sink end can be thought as DMA
> to buffer from device, and it is the upper layer(FS)'s responsibility
> for updating page flag. And block driver needn't to handle page
> status update.

The block device still needs to know when the DMA is finished, right?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
  2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
                   ` (4 preceding siblings ...)
  2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
@ 2023-02-14 16:36 ` Stefan Hajnoczi
  5 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2023-02-14 16:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, linux-block, linux-fsdevel, Alexander Viro,
	Miklos Szeredi, Bernd Schubert, Nitesh Shetty, Christoph Hellwig,
	Ziyang Zhang

[-- Attachment #1: Type: text/plain, Size: 2033 bytes --]

On Fri, Feb 10, 2023 at 11:32:08PM +0800, Ming Lei wrote:
> Hello,
> 
> Add two OPs which buffer is retrieved via kernel splice for supporting
> fuse/ublk zero copy.
> 
> The 1st patch enhances direct pipe & splice for moving pages in kernel,
> so that the two added OPs won't be misused, and avoid potential security
> hole.
> 
> The 2nd patch allows splice_direct_to_actor() caller to ignore signal
> if the actor won't block and can be done in bound time.
> 
> The 3rd patch add the two OPs.
> 
> The 4th patch implements ublk's ->splice_read() for supporting
> zero copy.
> 
> ublksrv(userspace):
> 
> https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf
>     
> So far, only loop/null target implements zero copy in above branch:
>     
> 	ublk add -t loop -f $file -z
> 	ublk add -t none -z
> 
> Basic FS/IO function is verified, mount/kernel building & fio
> works fine, and big chunk IO(BS: 64k/512k) performance gets improved
> obviously.
>  
> Any comment is welcome!

I'm not familiar enough with the splice implementation to review these
patches, but the performance numbers you posted look great. This could
be very nice for ublk and FUSE servers!

Thanks,
Stefan

> Ming Lei (4):
>   fs/splice: enhance direct pipe & splice for moving pages in kernel
>   fs/splice: allow to ignore signal in __splice_from_pipe
>   io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF
>   ublk_drv: support splice based read/write zero copy
> 
>  drivers/block/ublk_drv.c      | 169 +++++++++++++++++++++++++++++++--
>  fs/splice.c                   |  19 +++-
>  include/linux/pipe_fs_i.h     |  10 ++
>  include/linux/splice.h        |  23 +++++
>  include/uapi/linux/io_uring.h |   2 +
>  include/uapi/linux/ublk_cmd.h |  31 +++++-
>  io_uring/opdef.c              |  37 ++++++++
>  io_uring/rw.c                 | 174 +++++++++++++++++++++++++++++++++-
>  io_uring/rw.h                 |   1 +
>  9 files changed, 456 insertions(+), 10 deletions(-)
> 
> -- 
> 2.31.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-14 15:39               ` Miklos Szeredi
@ 2023-02-15  0:11                 ` Ming Lei
  2023-02-15 10:36                   ` Miklos Szeredi
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2023-02-15  0:11 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Jens Axboe, io-uring, linux-block, linux-fsdevel,
	Alexander Viro, Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert,
	Nitesh Shetty, Christoph Hellwig, Ziyang Zhang, ming.lei

On Tue, Feb 14, 2023 at 04:39:01PM +0100, Miklos Szeredi wrote:
> On Tue, 14 Feb 2023 at 15:35, Ming Lei <ming.lei@redhat.com> wrote:
> 
> > I understand it isn't one issue from block device driver viewpoint at
> > least, since the WRITE to buffer in sink end can be thought as DMA
> > to buffer from device, and it is the upper layer(FS)'s responsibility
> > for updating page flag. And block driver needn't to handle page
> > status update.
> 
> The block device still needs to know when the DMA is finished, right?

Yeah, for normal in-kernel device driver, the completion is triggered by
interrupt or io polling.

For ublk, io handling is done by userspace, here we use io_uring to
handle the IO in aio style. When the aio is completed, the userspace
gets notified of the completion.

Here the way is basically same with loop dio mode(losetup --direct-io=on),
it is still zero copy, pages from loop block driver are passed to FS(backing file)
directly for handling the original IO.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel
  2023-02-15  0:11                 ` Ming Lei
@ 2023-02-15 10:36                   ` Miklos Szeredi
  0 siblings, 0 replies; 30+ messages in thread
From: Miklos Szeredi @ 2023-02-15 10:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Linus Torvalds, Jens Axboe, io-uring, linux-block, linux-fsdevel,
	Alexander Viro, Stefan Hajnoczi, Miklos Szeredi, Bernd Schubert,
	Nitesh Shetty, Christoph Hellwig, Ziyang Zhang

On Wed, 15 Feb 2023 at 01:11, Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Feb 14, 2023 at 04:39:01PM +0100, Miklos Szeredi wrote:
> > On Tue, 14 Feb 2023 at 15:35, Ming Lei <ming.lei@redhat.com> wrote:
> >
> > > I understand it isn't one issue from block device driver viewpoint at
> > > least, since the WRITE to buffer in sink end can be thought as DMA
> > > to buffer from device, and it is the upper layer(FS)'s responsibility
> > > for updating page flag. And block driver needn't to handle page
> > > status update.
> >
> > The block device still needs to know when the DMA is finished, right?
>
> Yeah, for normal in-kernel device driver, the completion is triggered by
> interrupt or io polling.
>
> For ublk, io handling is done by userspace, here we use io_uring to
> handle the IO in aio style. When the aio is completed, the userspace
> gets notified of the completion.

I think it might be better if the write completion was directly
signaled to the original pipe buffer.  There are several advantages:

 - the kernel can guarantee (modulo bugs) that the buffer was
initialized, this is important if the userspace server is unprivileged

 - the server process does not need to be woken up on I/O completion

 - there could be a chain of splices involving various entities, yet
the original pipe buffer should always get the completion

I'm not sure what a good implementation would look like.  When a pipe
buffer is split, things get complicated.  Maybe just disallow
splitting on such buffers...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-02-15 10:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-10 15:32 [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-10 15:32 ` [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Ming Lei
2023-02-11 15:42   ` Ming Lei
2023-02-11 18:57     ` Linus Torvalds
2023-02-12  1:39       ` Ming Lei
2023-02-13 20:04         ` Linus Torvalds
2023-02-14  0:52           ` Ming Lei
2023-02-14  2:35             ` Ming Lei
2023-02-14 11:03           ` Miklos Szeredi
2023-02-14 14:35             ` Ming Lei
2023-02-14 15:39               ` Miklos Szeredi
2023-02-15  0:11                 ` Ming Lei
2023-02-15 10:36                   ` Miklos Szeredi
2023-02-10 15:32 ` [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Ming Lei
2023-02-10 15:32 ` [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Ming Lei
2023-02-11 15:45   ` Jens Axboe
2023-02-11 16:12     ` Ming Lei
2023-02-11 16:52       ` Jens Axboe
2023-02-12  3:22         ` Ming Lei
2023-02-12  3:55           ` Jens Axboe
2023-02-13  1:06             ` Ming Lei
2023-02-11 17:13   ` Jens Axboe
2023-02-12  1:48     ` Ming Lei
2023-02-12  2:42       ` Jens Axboe
2023-02-10 15:32 ` [PATCH 4/4] ublk_drv: support splice based read/write zero copy Ming Lei
2023-02-10 21:54 ` [PATCH 0/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Jens Axboe
2023-02-10 22:19   ` Jens Axboe
2023-02-11  5:13   ` Ming Lei
2023-02-11 15:45     ` Jens Axboe
2023-02-14 16:36 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).