All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk
@ 2023-02-15  0:41 Xiaoguang Wang
  2023-02-15  0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-15  0:41 UTC (permalink / raw)
  To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang

Normally, userspace block device impementations need to copy data between
kernel block layer's io requests and userspace block device's userspace
daemon, for example, ublk and tcmu both have similar logic, but this
operation will consume cpu resources obviously, especially for large io.

There are methods trying to reduce these cpu overheads, then userspace
block device's io performance will be improved further. These methods
contain: 1) use special hardware to do memory copy, but seems not all
architectures have these special hardware; 2) sofeware methods, such as
mmap kernel block layer's io requests's data to userspace daemon [1],
but it has page table's map/unmap, tlb flush overhead, security issue,
etc, and it maybe only friendly to large io.

Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
framework for implementing block device logic from userspace. Typical
userspace block device impementations need to copy data between kernel
block layer's io requests and userspace block device's userspace daemon,
which will consume cpu resources, especially for large io.

To solve this problem, I'd propose a new method, which will combine the
respective advantages of io_uring and ebpf. Add a new program type
BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should
register an ebpf prog. This bpf prog will use bpf helper offered by ublk
bpf prog type to submit io requests on behalf of daemon process.
Currently there is only one helper:
    u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx,
		struct io_uring_sqe *sqe, u32 sqe_len, u32, fd)

This helper will use io_uring to submit io requests, so we need to make
io_uring be able to submit a sqe located in kernel(Some codes idea comes
from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes
from userspace addr), and bpf prog initializes sqes, but does not need to
initializes sqes' buf field, sqe->buf will come from kernel block layer io
requests in some form. See patch 2 for more.

In example of ublk loop target, we can easily implement such below logic in
ebpf prog:
  1. userspace daemon registers an ebpf prog and passes two backend file
fd in ebpf map structure。
  2. For kernel io requests against the first half of userspace device,
ebpf prog prepares an io_uring sqe, which will submit io against the first
backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
io requests against second half of userspace device has similar logic,
only sqe's fd will be the second backend file fd.
  3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
be executed and completes kernel io requests.

That means, by using ebpf, we can implement various userspace log in kernel.

From above expample, we can see that this method has 3 advantages at least:
  1. Remove memory copy between kernel block layer and userspace daemon
completely.
  2. Save memory. Userspace daemon doesn't need to maintain memory to
issue and complete io requests, and use kernel block layer io requests
memory directly.
  2. We may reduce the number of round trips between kernel and userspace
daemon, so may reduce kernel & userspace context switch overheads.

Test:
Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file

fio job file:
  [global]
  direct=1
  filename=/dev/ublkb0
  time_based
  runtime=60
  numjobs=1
  cpus_allowed=1
  
  [rand-read-4k]
  bs=512K
  iodepth=16
  ioengine=libaio
  rw=randwrite
  stonewall


Without this patch:
  WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec
  ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool.

With this patch:
  WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec
  ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool.

From above tests, this method can reduce cpu copy overhead obviously.


TODO:
I must say this patchset is just a RFC for design.

1) Currently for this patchset, I just make ublk ebpf prog submit io requests
using io_uring in kernel, cqe event still needs to be handled in userspace
daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
ebpf prog can implement io in kernel.

2) ublk driver needs to work better with ebpf, currently I did some hack
codes to support ebpf in ublk driver, it only can support write requests.

3) I have not done much tests yet, will run liburing/ublk/blktests
later.

Any review and suggestions are welcome, thanks.

[1] https://lore.kernel.org/all/20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/cover.1621424513.git.asml.silence@gmail.com/


Xiaoguang Wang (3):
  bpf: add UBLK program type
  io_uring: enable io_uring to submit sqes located in kernel
  ublk_drv: add ebpf support

 drivers/block/ublk_drv.c       | 228 ++++++++++++++++++++++++++++++++-
 include/linux/bpf_types.h      |   2 +
 include/linux/io_uring.h       |  13 ++
 include/linux/io_uring_types.h |   8 +-
 include/uapi/linux/bpf.h       |   2 +
 include/uapi/linux/ublk_cmd.h  |  11 ++
 io_uring/io_uring.c            |  59 ++++++++-
 io_uring/rsrc.c                |  15 +++
 io_uring/rsrc.h                |   3 +
 io_uring/rw.c                  |   7 +
 kernel/bpf/syscall.c           |   1 +
 kernel/bpf/verifier.c          |   9 +-
 scripts/bpf_doc.py             |   4 +
 tools/include/uapi/linux/bpf.h |   9 ++
 tools/lib/bpf/libbpf.c         |   2 +
 15 files changed, 366 insertions(+), 7 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC 1/3] bpf: add UBLK program type
  2023-02-15  0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
@ 2023-02-15  0:41 ` Xiaoguang Wang
  2023-02-15 15:52   ` kernel test robot
  2023-02-15  0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-15  0:41 UTC (permalink / raw)
  To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang

Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
framework for implementing block device logic from userspace. Typical
userspace block device impementations need to copy data between kernel
block layer's io requests and userspace block device's userspace daemon,
which will consume cpu resources, especially for large io.

There are methods trying to reduce these cpu overheads, then userspace
block device's io performance will be improved further. These methods
contain: 1) use special hardware to do memory copy, but seems not all
architectures have these special hardware; 2) sofeware methods, such as
mmap kernel block layer's io requests's data to userspace daemon [1],
but it has page table's map/unmap, tlb flush overhead, etc, which maybe
only friendly to large io.

To solve this problem, I'd propose a new method which will use io_uring
to submit userspace daemon's io requests in kernel and use kernel block
device io requests's pages. Further, userspace block devices may have
different userspace logic about how to complete kernel io requests, here
we can use ebpf to implement various userspace log in kernel. In the
example of ublk loop target, we can easily implement such below logic in
ebpf prog:
  1. userspace daemon registers this ebpf prog and passes two backend file
fd in ebpf map structure。
  2. For kernel io requests against the first half of userspace device,
ebpf prog prepares an io_uring sqe, which will submit io against the first
backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
io requests against second half of userspace device has similar logic,
only sqe's fd will be the second backend file fd.
  3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
be executed and completes kernel io requests.

From above expample, we can see that this method has 3 advantages at least:
  1. Remove memory copy between kernel block layer and userspace daemon
completely.
  2. Save memory. Userspace daemon doesn't need to maintain memory to
issue and complete io requests, and use kernel block layer io requests
memory directly.
  2. We may reduce the numberr of  round trips between kernel and userspace
daemon.

Currently for this patchset, I just make ublk ebpf prog submit io requests
using io_uring in kernel, cqe event still needs to be handled in userspace
daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
ebpf prog can implement io in kernel.

[1] https://lore.kernel.org/all/20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com/

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 drivers/block/ublk_drv.c       | 23 +++++++++++++++++++++++
 include/linux/bpf_types.h      |  2 ++
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  9 +++++++--
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  2 ++
 7 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 6368b56eacf1..b628e9eaefa6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -43,6 +43,8 @@
 #include <asm/page.h>
 #include <linux/task_work.h>
 #include <uapi/linux/ublk_cmd.h>
+#include <linux/filter.h>
+#include <linux/bpf.h>
 
 #define UBLK_MINORS		(1U << MINORBITS)
 
@@ -187,6 +189,27 @@ static DEFINE_MUTEX(ublk_ctl_mutex);
 
 static struct miscdevice ublk_misc;
 
+static const struct bpf_func_proto *
+ublk_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static bool ublk_bpf_is_valid_access(int off, int size,
+			enum bpf_access_type type,
+			const struct bpf_prog *prog,
+			struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+const struct bpf_prog_ops bpf_ublk_prog_ops = {};
+
+const struct bpf_verifier_ops bpf_ublk_verifier_ops = {
+	.get_func_proto		= ublk_bpf_func_proto,
+	.is_valid_access	= ublk_bpf_is_valid_access,
+};
+
 static void ublk_dev_param_basic_apply(struct ublk_device *ub)
 {
 	struct request_queue *q = ub->ub_disk->queue;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d4ee3ccd3753..4ef0bc0251b7 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -79,6 +79,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
 #endif
 BPF_PROG_TYPE(BPF_PROG_TYPE_SYSCALL, bpf_syscall,
 	      void *, void *)
+BPF_PROG_TYPE(BPF_PROG_TYPE_UBLK, bpf_ublk,
+	      void *, void *)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 464ca3f01fe7..515b7b995b3a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -986,6 +986,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_UBLK,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ecca9366c7a6..eb1752243f4f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2432,6 +2432,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_SOCK_OPS:
+	case BPF_PROG_TYPE_UBLK:
 	case BPF_PROG_TYPE_EXT: /* extends any prog */
 		return true;
 	case BPF_PROG_TYPE_CGROUP_SKB:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7ee218827259..1e5bc89aea36 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12235,6 +12235,10 @@ static int check_return_code(struct bpf_verifier_env *env)
 		}
 		break;
 
+	case BPF_PROG_TYPE_UBLK:
+		range = tnum_const(0);
+		break;
+
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
@@ -16770,8 +16774,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
 	}
 
 	if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
-	    prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_KPROBE) {
-		verbose(env, "Only fentry/fexit/fmod_ret, lsm, and kprobe/uprobe programs can be sleepable\n");
+	    prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_KPROBE &&
+	    prog->type != BPF_PROG_TYPE_UBLK) {
+		verbose(env, "Only fentry/fexit/fmod_ret, lsm, and kprobe/uprobe, ublk programs can be sleepable\n");
 		return -EINVAL;
 	}
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 464ca3f01fe7..515b7b995b3a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -986,6 +986,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_UBLK,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2a82f49ce16f..6fe77f9a2cc8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8606,6 +8606,8 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("cgroup/dev",		CGROUP_DEVICE, BPF_CGROUP_DEVICE, SEC_ATTACHABLE_OPT),
 	SEC_DEF("struct_ops+",		STRUCT_OPS, 0, SEC_NONE),
 	SEC_DEF("sk_lookup",		SK_LOOKUP, BPF_SK_LOOKUP, SEC_ATTACHABLE),
+	SEC_DEF("ublk/",		UBLK, 0, SEC_SLEEPABLE),
+	SEC_DEF("ublk.s/",		UBLK, 0, SEC_SLEEPABLE),
 };
 
 static size_t custom_sec_def_cnt;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
  2023-02-15  0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
  2023-02-15  0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
@ 2023-02-15  0:41 ` Xiaoguang Wang
  2023-02-15 14:17   ` kernel test robot
                     ` (2 more replies)
  2023-02-15  0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-15  0:41 UTC (permalink / raw)
  To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang

Currently this feature can be used by userspace block device to reduce
kernel & userspace memory copy overhead. With this feature, userspace
block device driver can submit and complete io requests using kernel
block layer io requests's memory data, and further, by using ebpf, we
can customize how sqe is initialized, how io is submitted and completed.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 include/linux/io_uring.h       | 13 ++++++++
 include/linux/io_uring_types.h |  8 ++++-
 io_uring/io_uring.c            | 59 ++++++++++++++++++++++++++++++++--
 io_uring/rsrc.c                | 15 +++++++++
 io_uring/rsrc.h                |  3 ++
 io_uring/rw.c                  |  7 ++++
 6 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 934e5dd4ccc0..d69882c98608 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -36,6 +36,12 @@ struct io_uring_cmd {
 	u8		pdu[32]; /* available inline for free use */
 };
 
+struct io_mapped_kbuf {
+	size_t count;
+	unsigned int nr_bvecs;
+	struct bio_vec *bvec;
+};
+
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
@@ -65,6 +71,8 @@ static inline void io_uring_free(struct task_struct *tsk)
 	if (tsk->io_uring)
 		__io_uring_free(tsk);
 }
+int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
+			struct io_mapped_kbuf *kbuf);
 #else
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
@@ -96,6 +104,11 @@ static inline const char *io_uring_get_opcode(u8 opcode)
 {
 	return "";
 }
+int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
+			struct io_mapped_kbuf *kbuf)
+{
+	return 0;
+}
 #endif
 
 #endif
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 128a67a40065..260f8365c802 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -398,6 +398,7 @@ enum {
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_KBUF_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -467,6 +468,8 @@ enum {
 	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
 	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
+	/* buffer comes from kernel */
+	REQ_F_KBUF		= BIT(REQ_F_KBUF_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, bool *locked);
@@ -527,7 +530,7 @@ struct io_kiocb {
 	 * and after selection it points to the buffer ID itself.
 	 */
 	u16				buf_index;
-	unsigned int			flags;
+	u64				flags;
 
 	struct io_cqe			cqe;
 
@@ -540,6 +543,9 @@ struct io_kiocb {
 		/* store used ubuf, so we can prevent reloading */
 		struct io_mapped_ubuf	*imu;
 
+		/* store used kbuf */
+		struct io_mapped_kbuf	*imk;
+
 		/* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
 		struct io_buffer	*kbuf;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index db623b3185c8..a174365470fb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2232,7 +2232,8 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 }
 
 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
+			 const struct io_uring_sqe *sqe,
+			 struct io_mapped_kbuf *kbuf)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_link *link = &ctx->submit_state.link;
@@ -2241,6 +2242,10 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	ret = io_init_req(ctx, req, sqe);
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
+	if (unlikely(kbuf)) {
+		req->imk = kbuf;
+		req->flags |= REQ_F_KBUF;
+	}
 
 	/* don't need @sqe from now on */
 	trace_io_uring_submit_sqe(req, true);
@@ -2392,7 +2397,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		 * Continue submitting even for sqe failure if the
 		 * ring was setup with IORING_SETUP_SUBMIT_ALL
 		 */
-		if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
+		if (unlikely(io_submit_sqe(ctx, req, sqe, NULL)) &&
 		    !(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
 			left--;
 			break;
@@ -3272,6 +3277,54 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
 	return 0;
 }
 
+int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
+			struct io_mapped_kbuf *kbuf)
+{
+	struct io_kiocb *req;
+	struct fd f;
+	int ret;
+	struct io_ring_ctx *ctx;
+
+	f = fdget(fd);
+	if (unlikely(!f.file))
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (unlikely(!io_is_uring_fops(f.file))) {
+		ret = -EBADF;
+		goto out;
+	}
+	ctx = f.file->private_data;
+
+	mutex_lock(&ctx->uring_lock);
+	if (unlikely(!io_alloc_req_refill(ctx)))
+		goto out;
+	req = io_alloc_req(ctx);
+	if (unlikely(!req)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	if (!percpu_ref_tryget_many(&ctx->refs, 1)) {
+		kmem_cache_free(req_cachep, req);
+		ret = -EAGAIN;
+		goto out;
+	}
+	percpu_counter_add(&current->io_uring->inflight, 1);
+	refcount_add(1, &current->usage);
+
+	/* returns number of submitted SQEs or an error */
+	ret = !io_submit_sqe(ctx, req, sqe, kbuf);
+	mutex_unlock(&ctx->uring_lock);
+	fdput(f);
+	return ret;
+
+out:
+	mutex_unlock(&ctx->uring_lock);
+	fdput(f);
+	return ret;
+}
+EXPORT_SYMBOL(io_uring_submit_sqe);
+
 SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		u32, min_complete, u32, flags, const void __user *, argp,
 		size_t, argsz)
@@ -4270,7 +4323,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
 	BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS);
 
-	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(u64));
 
 	BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
 
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 18de10c68a15..51861f01185f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1380,3 +1380,18 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 
 	return 0;
 }
+
+int io_import_fixed_kbuf(int ddir, struct iov_iter *iter,
+			 struct io_mapped_kbuf *kbuf,
+			u64 offset, size_t len)
+{
+	if (WARN_ON_ONCE(!kbuf))
+		return -EFAULT;
+	if (offset >= kbuf->count)
+		return -EFAULT;
+
+	iov_iter_bvec(iter, ddir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
+	iov_iter_advance(iter, offset);
+	return 0;
+}
+
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b8743645efc..c6897d218bb9 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -69,6 +69,9 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 			   struct io_mapped_ubuf *imu,
 			   u64 buf_addr, size_t len);
 
+int io_import_fixed_kbuf(int ddir, struct iov_iter *iter,
+		struct io_mapped_kbuf *kbuf, u64 buf_addr, size_t len);
+
 void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
 int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 9c3ddd46a1ad..bdf4c4f0661f 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -378,6 +378,13 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
 		return NULL;
 	}
 
+	if (unlikely(req->flags & REQ_F_KBUF)) {
+		ret = io_import_fixed_kbuf(ddir, iter, req->imk, rw->addr, rw->len);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
 	buf = u64_to_user_ptr(rw->addr);
 	sqe_len = rw->len;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 3/3] ublk_drv: add ebpf support
  2023-02-15  0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
  2023-02-15  0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
  2023-02-15  0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
@ 2023-02-15  0:41 ` Xiaoguang Wang
  2023-02-16  8:11   ` Ming Lei
  2023-02-15  0:46 ` [UBLKSRV] Add " Xiaoguang Wang
  2023-02-15  8:40 ` [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Ziyang Zhang
  4 siblings, 1 reply; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-15  0:41 UTC (permalink / raw)
  To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang

Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
can use this helper to write ebpf prog to support ublk kernel & usersapce
zero copy, please see ublksrv test codes for more info.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 drivers/block/ublk_drv.c       | 207 ++++++++++++++++++++++++++++++++-
 include/uapi/linux/bpf.h       |   1 +
 include/uapi/linux/ublk_cmd.h  |  11 ++
 scripts/bpf_doc.py             |   4 +
 tools/include/uapi/linux/bpf.h |   8 ++
 5 files changed, 229 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b628e9eaefa6..44c289b72864 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -61,6 +61,7 @@
 struct ublk_rq_data {
 	struct llist_node node;
 	struct callback_head work;
+	struct io_mapped_kbuf *kbuf;
 };
 
 struct ublk_uring_cmd_pdu {
@@ -163,6 +164,9 @@ struct ublk_device {
 	unsigned int		nr_queues_ready;
 	atomic_t		nr_aborted_queues;
 
+	struct bpf_prog		*io_prep_prog;
+	struct bpf_prog		*io_submit_prog;
+
 	/*
 	 * Our ubq->daemon may be killed without any notification, so
 	 * monitor each queue's daemon periodically
@@ -189,10 +193,46 @@ static DEFINE_MUTEX(ublk_ctl_mutex);
 
 static struct miscdevice ublk_misc;
 
+struct ublk_io_bpf_ctx {
+	struct ublk_bpf_ctx ctx;
+	struct ublk_device *ub;
+	struct callback_head work;
+};
+
+BPF_CALL_4(bpf_ublk_queue_sqe, struct ublk_io_bpf_ctx *, bpf_ctx,
+	   struct io_uring_sqe *, sqe, u32, sqe_len, u32, fd)
+{
+	struct request *rq;
+	struct ublk_rq_data *data;
+	struct io_mapped_kbuf *kbuf;
+	u16 q_id = bpf_ctx->ctx.q_id;
+	u16 tag = bpf_ctx->ctx.tag;
+
+	rq = blk_mq_tag_to_rq(bpf_ctx->ub->tag_set.tags[q_id], tag);
+	data = blk_mq_rq_to_pdu(rq);
+	kbuf = data->kbuf;
+	io_uring_submit_sqe(fd, sqe, sqe_len, kbuf);
+	return 0;
+}
+
+const struct bpf_func_proto ublk_bpf_queue_sqe_proto = {
+	.func = bpf_ublk_queue_sqe,
+	.gpl_only = false,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_ANYTHING,
+	.arg2_type = ARG_ANYTHING,
+	.arg3_type = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 ublk_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	switch (func_id) {
+	case BPF_FUNC_ublk_queue_sqe:
+		return &ublk_bpf_queue_sqe_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
 }
 
 static bool ublk_bpf_is_valid_access(int off, int size,
@@ -200,6 +240,23 @@ static bool ublk_bpf_is_valid_access(int off, int size,
 			const struct bpf_prog *prog,
 			struct bpf_insn_access_aux *info)
 {
+	if (off < 0 || off >= sizeof(struct ublk_bpf_ctx))
+		return false;
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct ublk_bpf_ctx, q_id):
+		return size == sizeof_field(struct ublk_bpf_ctx, q_id);
+	case offsetof(struct ublk_bpf_ctx, tag):
+		return size == sizeof_field(struct ublk_bpf_ctx, tag);
+	case offsetof(struct ublk_bpf_ctx, op):
+		return size == sizeof_field(struct ublk_bpf_ctx, op);
+	case offsetof(struct ublk_bpf_ctx, nr_sectors):
+		return size == sizeof_field(struct ublk_bpf_ctx, nr_sectors);
+	case offsetof(struct ublk_bpf_ctx, start_sector):
+		return size == sizeof_field(struct ublk_bpf_ctx, start_sector);
+	}
 	return false;
 }
 
@@ -324,7 +381,7 @@ static void ublk_put_device(struct ublk_device *ub)
 static inline struct ublk_queue *ublk_get_queue(struct ublk_device *dev,
 		int qid)
 {
-       return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
+	return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
 }
 
 static inline bool ublk_rq_has_data(const struct request *rq)
@@ -492,12 +549,16 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
 static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 		struct ublk_io *io)
 {
+	struct ublk_device *ub = ubq->dev;
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 	/*
 	 * no zero copy, we delay copy WRITE request data into ublksrv
 	 * context and the big benefit is that pinning pages in current
 	 * context is pretty fast, see ublk_pin_user_pages
 	 */
+	if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
+		return rq_bytes;
+
 	if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
 		return rq_bytes;
 
@@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
 	}
 }
 
+static void ublk_bpf_io_submit_fn(struct callback_head *work)
+{
+	struct ublk_io_bpf_ctx *bpf_ctx = container_of(work,
+			struct ublk_io_bpf_ctx, work);
+
+	if (bpf_ctx->ub->io_submit_prog)
+		bpf_prog_run_pin_on_cpu(bpf_ctx->ub->io_submit_prog, bpf_ctx);
+	kfree(bpf_ctx);
+}
+
+static int ublk_init_uring_kbuf(struct request *rq)
+{
+	struct bio_vec *bvec;
+	struct req_iterator rq_iter;
+	struct bio_vec tmp;
+	int nr_bvec = 0;
+	struct io_mapped_kbuf *kbuf;
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+
+	/* Drop previous allocation */
+	if (data->kbuf) {
+		kfree(data->kbuf->bvec);
+		kfree(data->kbuf);
+		data->kbuf = NULL;
+	}
+
+	kbuf = kmalloc(sizeof(struct io_mapped_kbuf), GFP_NOIO);
+	if (!kbuf)
+		return -EIO;
+
+	rq_for_each_bvec(tmp, rq, rq_iter)
+		nr_bvec++;
+
+	bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), GFP_NOIO);
+	if (!bvec) {
+		kfree(kbuf);
+		return -EIO;
+	}
+	kbuf->bvec = bvec;
+	rq_for_each_bvec(tmp, rq, rq_iter) {
+		*bvec = tmp;
+		bvec++;
+	}
+
+	kbuf->count = blk_rq_bytes(rq);
+	kbuf->nr_bvecs = nr_bvec;
+	data->kbuf = kbuf;
+	return 0;
+}
+
+static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
+{
+	int err;
+	struct ublk_device *ub = ubq->dev;
+	struct bpf_prog *prog = ub->io_prep_prog;
+	struct ublk_io_bpf_ctx *bpf_ctx;
+
+	if (!prog)
+		return 0;
+
+	bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
+	if (!bpf_ctx)
+		return -EIO;
+
+	err = ublk_init_uring_kbuf(rq);
+	if (err < 0) {
+		kfree(bpf_ctx);
+		return -EIO;
+	}
+	bpf_ctx->ub = ub;
+	bpf_ctx->ctx.q_id = ubq->q_id;
+	bpf_ctx->ctx.tag = rq->tag;
+	bpf_ctx->ctx.op = req_op(rq);
+	bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
+	bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
+	bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
+
+	init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
+	if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
+		kfree(bpf_ctx);
+	return 0;
+}
+
 static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
@@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (unlikely(res != BLK_STS_OK))
 		return BLK_STS_IOERR;
 
+	/* Currently just for test. */
+	ublk_run_bpf_prog(ubq, rq);
+
 	/* With recovery feature enabled, force_abort is set in
 	 * ublk_stop_dev() before calling del_gendisk(). We have to
 	 * abort all requeued and new rqs here to let del_gendisk()
@@ -2009,6 +2156,56 @@ static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd)
 	return ret;
 }
 
+static int ublk_ctrl_reg_bpf_prog(struct io_uring_cmd *cmd)
+{
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublk_device *ub;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ub = ublk_get_device_from_id(header->dev_id);
+	if (!ub)
+		return -EINVAL;
+
+	mutex_lock(&ub->mutex);
+	prog = bpf_prog_get_type(header->data[0], BPF_PROG_TYPE_UBLK);
+	if (IS_ERR(prog)) {
+		ret = PTR_ERR(prog);
+		goto out_unlock;
+	}
+	ub->io_prep_prog = prog;
+
+	prog = bpf_prog_get_type(header->data[1], BPF_PROG_TYPE_UBLK);
+	if (IS_ERR(prog)) {
+		ret = PTR_ERR(prog);
+		goto out_unlock;
+	}
+	ub->io_submit_prog = prog;
+
+out_unlock:
+	mutex_unlock(&ub->mutex);
+	ublk_put_device(ub);
+	return ret;
+}
+
+static int ublk_ctrl_unreg_bpf_prog(struct io_uring_cmd *cmd)
+{
+	struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+	struct ublk_device *ub;
+
+	ub = ublk_get_device_from_id(header->dev_id);
+	if (!ub)
+		return -EINVAL;
+
+	mutex_lock(&ub->mutex);
+	bpf_prog_put(ub->io_prep_prog);
+	bpf_prog_put(ub->io_submit_prog);
+	ub->io_prep_prog = NULL;
+	ub->io_submit_prog = NULL;
+	mutex_unlock(&ub->mutex);
+	ublk_put_device(ub);
+	return 0;
+}
 static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
@@ -2059,6 +2256,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 	case UBLK_CMD_END_USER_RECOVERY:
 		ret = ublk_ctrl_end_recovery(cmd);
 		break;
+	case UBLK_CMD_REG_BPF_PROG:
+		ret = ublk_ctrl_reg_bpf_prog(cmd);
+		break;
+	case UBLK_CMD_UNREG_BPF_PROG:
+		ret = ublk_ctrl_unreg_bpf_prog(cmd);
+		break;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 515b7b995b3a..578d65e9f30e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5699,6 +5699,7 @@ union bpf_attr {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(ublk_queue_sqe, 212, ##ctx)			\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 8f88e3a29998..a43b1864de51 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -17,6 +17,8 @@
 #define	UBLK_CMD_STOP_DEV	0x07
 #define	UBLK_CMD_SET_PARAMS	0x08
 #define	UBLK_CMD_GET_PARAMS	0x09
+#define UBLK_CMD_REG_BPF_PROG		0x0a
+#define UBLK_CMD_UNREG_BPF_PROG		0x0b
 #define	UBLK_CMD_START_USER_RECOVERY	0x10
 #define	UBLK_CMD_END_USER_RECOVERY	0x11
 /*
@@ -230,4 +232,13 @@ struct ublk_params {
 	struct ublk_param_discard	discard;
 };
 
+struct ublk_bpf_ctx {
+	__u32	t_val;
+	__u16	q_id;
+	__u16	tag;
+	__u8	op;
+	__u32	nr_sectors;
+	__u64	start_sector;
+};
+
 #endif
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index e8d90829f23e..f8672294e145 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -700,6 +700,8 @@ class PrinterHelpers(Printer):
             'struct bpf_dynptr',
             'struct iphdr',
             'struct ipv6hdr',
+            'struct ublk_io_bpf_ctx',
+            'struct io_uring_sqe',
     ]
     known_types = {
             '...',
@@ -755,6 +757,8 @@ class PrinterHelpers(Printer):
             'const struct bpf_dynptr',
             'struct iphdr',
             'struct ipv6hdr',
+            'struct ublk_io_bpf_ctx',
+            'struct io_uring_sqe',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 515b7b995b3a..530094246e2a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5485,6 +5485,13 @@ union bpf_attr {
  *		0 on success.
  *
  *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *ctx, struct io_uring_sqe *sqe, u32 offset, u32 len)
+ *	Description
+ *		Submit ublk io requests.
+ *	Return
+ *		0 on success.
+ *
  */
 #define ___BPF_FUNC_MAPPER(FN, ctx...)			\
 	FN(unspec, 0, ##ctx)				\
@@ -5699,6 +5706,7 @@ union bpf_attr {
 	FN(user_ringbuf_drain, 209, ##ctx)		\
 	FN(cgrp_storage_get, 210, ##ctx)		\
 	FN(cgrp_storage_delete, 211, ##ctx)		\
+	FN(ublk_queue_sqe, 212, ##ctx)			\
 	/* */
 
 /* backwards-compatibility macros for users of __BPF_FUNC_MAPPER that don't
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [UBLKSRV] Add ebpf support.
  2023-02-15  0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
                   ` (2 preceding siblings ...)
  2023-02-15  0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
@ 2023-02-15  0:46 ` Xiaoguang Wang
  2023-02-16  8:28   ` Ming Lei
  2023-02-15  8:40 ` [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Ziyang Zhang
  4 siblings, 1 reply; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-15  0:46 UTC (permalink / raw)
  To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, ZiyangZhang

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 bpf/ublk.bpf.c         | 168 +++++++++++++++++++++++++++++++++++++++++
 include/ublk_cmd.h     |   2 +
 include/ublksrv.h      |   8 ++
 include/ublksrv_priv.h |   1 +
 include/ublksrv_tgt.h  |   1 +
 lib/ublksrv.c          |   4 +
 lib/ublksrv_cmd.c      |  21 ++++++
 tgt_loop.cpp           |  31 +++++++-
 ublksrv_tgt.cpp        |  33 ++++++++
 9 files changed, 268 insertions(+), 1 deletion(-)
 create mode 100644 bpf/ublk.bpf.c

diff --git a/bpf/ublk.bpf.c b/bpf/ublk.bpf.c
new file mode 100644
index 0000000..80e79de
--- /dev/null
+++ b/bpf/ublk.bpf.c
@@ -0,0 +1,168 @@
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+
+
+static long (*bpf_ublk_queue_sqe)(void *ctx, struct io_uring_sqe *sqe,
+		u32 sqe_len, u32 fd) = (void *) 212;
+
+int target_fd = -1;
+
+struct sqe_key {
+	u16 q_id;
+	u16 tag;
+	u32 res;
+	u64 offset;
+};
+
+struct sqe_data {
+	char data[128];
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 8192);
+	__type(key, struct sqe_key);
+	__type(value, struct sqe_data);
+} sqes_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 128);
+	__type(key, int);
+	__type(value, int);
+} uring_fd_map SEC(".maps");
+
+static inline void io_uring_prep_rw(__u8 op, struct io_uring_sqe *sqe, int fd,
+				    const void *addr, unsigned len,
+				    __u64 offset)
+{
+	sqe->opcode = op;
+	sqe->flags = 0;
+	sqe->ioprio = 0;
+	sqe->fd = fd;
+	sqe->off = offset;
+	sqe->addr = (unsigned long) addr;
+	sqe->len = len;
+	sqe->fsync_flags = 0;
+	sqe->buf_index = 0;
+	sqe->personality = 0;
+	sqe->splice_fd_in = 0;
+	sqe->addr3 = 0;
+	sqe->__pad2[0] = 0;
+}
+
+static inline void io_uring_prep_nop(struct io_uring_sqe *sqe)
+{
+	io_uring_prep_rw(IORING_OP_NOP, sqe, -1, 0, 0, 0);
+}
+
+static inline void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
+			void *buf, unsigned nbytes, off_t offset)
+{
+	io_uring_prep_rw(IORING_OP_READ, sqe, fd, buf, nbytes, offset);
+}
+
+static inline void io_uring_prep_write(struct io_uring_sqe *sqe, int fd,
+	const void *buf, unsigned nbytes, off_t offset)
+{
+	io_uring_prep_rw(IORING_OP_WRITE, sqe, fd, buf, nbytes, offset);
+}
+
+/*
+static u64 submit_sqe(struct bpf_map *map, void *key, void *value, void *data)
+{
+	struct io_uring_sqe *sqe = (struct io_uring_sqe *)value;
+	struct ublk_bpf_ctx *ctx = ((struct callback_ctx *)data)->ctx;
+	struct sqe_key *skey = (struct sqe_key *)key;
+	char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
+	char fmt2[] ="submit sqe test prep\n";
+	u16 qid, tag;
+	int q_id = skey->q_id, *ring_fd;
+
+	bpf_trace_printk(fmt2, sizeof(fmt2));
+	ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
+	if (ring_fd) {
+		bpf_trace_printk(fmt, sizeof(fmt), qid, skey->tag);
+		bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
+		bpf_map_delete_elem(map, key);
+	}
+	return 0;
+}
+*/
+
+static inline __u64 build_user_data(unsigned tag, unsigned op,
+			unsigned tgt_data, unsigned is_target_io,
+			unsigned is_bpf_io)
+{
+	return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63 |
+		(__u64)is_bpf_io << 60;
+}
+
+SEC("ublk.s/")
+int ublk_io_prep_prog(struct ublk_bpf_ctx *ctx)
+{
+	struct io_uring_sqe *sqe;
+	struct sqe_data sd = {0};
+	struct sqe_key key;
+	u16 q_id = ctx->q_id;
+	u8 op; // = ctx->op;
+	u32 nr_sectors = ctx->nr_sectors;
+	u64 start_sector = ctx->start_sector;
+	char fmt_1[] ="ublk_io_prep_prog %d %d\n";
+
+	key.q_id = ctx->q_id;
+	key.tag = ctx->tag;
+	key.offset = 0;
+	key.res = 0;
+
+	bpf_probe_read_kernel(&op, 1, &ctx->op);
+	bpf_trace_printk(fmt_1, sizeof(fmt_1), q_id, op);
+	sqe = (struct io_uring_sqe *)&sd;
+	if (op == REQ_OP_READ) {
+		char fmt[] ="add read sae\n";
+
+		bpf_trace_printk(fmt, sizeof(fmt));
+		io_uring_prep_read(sqe, target_fd, 0, nr_sectors << 9,
+				   start_sector << 9);
+		sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
+		bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
+	} else if (op == REQ_OP_WRITE) {
+		char fmt[] ="add write sae\n";
+
+		bpf_trace_printk(fmt, sizeof(fmt));
+
+		io_uring_prep_write(sqe, target_fd, 0, nr_sectors << 9,
+				    start_sector << 9);
+		sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
+		bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
+	} else {
+		;
+	}
+	return 0;
+}
+
+SEC("ublk.s/")
+int ublk_io_submit_prog(struct ublk_bpf_ctx *ctx)
+{
+	struct io_uring_sqe *sqe;
+	char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
+	int q_id = ctx->q_id, *ring_fd;
+	struct sqe_key key;
+
+	key.q_id = ctx->q_id;
+	key.tag = ctx->tag;
+	key.offset = 0;
+	key.res = 0;
+
+	sqe = bpf_map_lookup_elem(&sqes_map, &key);
+	ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
+	if (ring_fd) {
+		bpf_trace_printk(fmt, sizeof(fmt), key.q_id, key.tag);
+		bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
+		bpf_map_delete_elem(&sqes_map, &key);
+	}
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
index f6238cc..893ba8c 100644
--- a/include/ublk_cmd.h
+++ b/include/ublk_cmd.h
@@ -17,6 +17,8 @@
 #define	UBLK_CMD_STOP_DEV	0x07
 #define	UBLK_CMD_SET_PARAMS	0x08
 #define	UBLK_CMD_GET_PARAMS	0x09
+#define UBLK_CMD_REG_BPF_PROG		0x0a
+#define UBLK_CMD_UNREG_BPF_PROG		0x0b
 #define	UBLK_CMD_START_USER_RECOVERY	0x10
 #define	UBLK_CMD_END_USER_RECOVERY	0x11
 #define	UBLK_CMD_GET_DEV_INFO2		0x12
diff --git a/include/ublksrv.h b/include/ublksrv.h
index d38bd46..f5deddb 100644
--- a/include/ublksrv.h
+++ b/include/ublksrv.h
@@ -106,6 +106,7 @@ struct ublksrv_tgt_info {
 	unsigned int nr_fds;
 	int fds[UBLKSRV_TGT_MAX_FDS];
 	void *tgt_data;
+	void *tgt_bpf_obj;
 
 	/*
 	 * Extra IO slots for each queue, target code can reserve some
@@ -263,6 +264,8 @@ struct ublksrv_tgt_type {
 	int (*init_queue)(const struct ublksrv_queue *, void **queue_data_ptr);
 	void (*deinit_queue)(const struct ublksrv_queue *);
 
+	int (*init_queue_bpf)(const struct ublksrv_dev *dev, const struct ublksrv_queue *q);
+
 	unsigned long reserved[5];
 };
 
@@ -318,6 +321,11 @@ extern void ublksrv_ctrl_prep_recovery(struct ublksrv_ctrl_dev *dev,
 		const char *recovery_jbuf);
 extern const char *ublksrv_ctrl_get_recovery_jbuf(const struct ublksrv_ctrl_dev *dev);
 
+extern void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,
+					  void *obj);
+extern int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
+				     int io_prep_fd, int io_submit_fd);
+
 /* ublksrv device ("/dev/ublkcN") level APIs */
 extern const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *
 		ctrl_dev);
diff --git a/include/ublksrv_priv.h b/include/ublksrv_priv.h
index 2996baa..8da8866 100644
--- a/include/ublksrv_priv.h
+++ b/include/ublksrv_priv.h
@@ -42,6 +42,7 @@ struct ublksrv_ctrl_dev {
 
 	const char *tgt_type;
 	const struct ublksrv_tgt_type *tgt_ops;
+	void *bpf_obj;
 
 	/*
 	 * default is UBLKSRV_RUN_DIR but can be specified via command line,
diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
index 234d31e..e0db7d9 100644
--- a/include/ublksrv_tgt.h
+++ b/include/ublksrv_tgt.h
@@ -9,6 +9,7 @@
 #include <getopt.h>
 #include <string.h>
 #include <stdarg.h>
+#include <limits.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/ioctl.h>
diff --git a/lib/ublksrv.c b/lib/ublksrv.c
index 96bed95..110ccb3 100644
--- a/lib/ublksrv.c
+++ b/lib/ublksrv.c
@@ -603,6 +603,9 @@ skip_alloc_buf:
 		goto fail;
 	}
 
+	if (dev->tgt.ops->init_queue_bpf)
+		dev->tgt.ops->init_queue_bpf(tdev, local_to_tq(q));
+
 	ublksrv_dev_init_io_cmds(dev, q);
 
 	/*
@@ -723,6 +726,7 @@ const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *ctrl_d
 	}
 
 	tgt->fds[0] = dev->cdev_fd;
+	tgt->tgt_bpf_obj = ctrl_dev->bpf_obj;
 
 	ret = ublksrv_tgt_init(dev, ctrl_dev->tgt_type, ctrl_dev->tgt_ops,
 			ctrl_dev->tgt_argc, ctrl_dev->tgt_argv);
diff --git a/lib/ublksrv_cmd.c b/lib/ublksrv_cmd.c
index 0d7265d..0101cb9 100644
--- a/lib/ublksrv_cmd.c
+++ b/lib/ublksrv_cmd.c
@@ -502,6 +502,27 @@ int ublksrv_ctrl_end_recovery(struct ublksrv_ctrl_dev *dev, int daemon_pid)
 	return ret;
 }
 
+int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
+			      int io_prep_fd, int io_submit_fd)
+{
+	struct ublksrv_ctrl_cmd_data data = {
+		.cmd_op = UBLK_CMD_REG_BPF_PROG,
+		.flags = CTRL_CMD_HAS_DATA,
+	};
+	int ret;
+
+	data.data[0] = io_prep_fd;
+	data.data[1] = io_submit_fd;
+
+	ret = __ublksrv_ctrl_cmd(dev, &data);
+	return ret;
+}
+
+void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,  void *obj)
+{
+	dev->bpf_obj = obj;
+}
+
 const struct ublksrv_ctrl_dev_info *ublksrv_ctrl_get_dev_info(
 		const struct ublksrv_ctrl_dev *dev)
 {
diff --git a/tgt_loop.cpp b/tgt_loop.cpp
index 79a65d3..b1568fe 100644
--- a/tgt_loop.cpp
+++ b/tgt_loop.cpp
@@ -4,7 +4,11 @@
 
 #include <poll.h>
 #include <sys/epoll.h>
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
 #include "ublksrv_tgt.h"
+#include "bpf/.tmp/ublk.skel.h"
 
 static bool backing_supports_discard(char *name)
 {
@@ -88,6 +92,20 @@ static int loop_recovery_tgt(struct ublksrv_dev *dev, int type)
 	return 0;
 }
 
+static int loop_init_queue_bpf(const struct ublksrv_dev *dev,
+			       const struct ublksrv_queue *q)
+{
+	int ret, q_id, ring_fd;
+	const struct ublksrv_tgt_info *tgt = &dev->tgt;
+	struct ublk_bpf *obj = (struct ublk_bpf*)tgt->tgt_bpf_obj;
+
+	q_id = q->q_id;
+	ring_fd = q->ring_ptr->ring_fd;
+	ret = bpf_map_update_elem(bpf_map__fd(obj->maps.uring_fd_map), &q_id,
+				  &ring_fd,  0);
+	return ret;
+}
+
 static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
 		*argv[])
 {
@@ -125,6 +143,7 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
 		},
 	};
 	bool can_discard = false;
+	struct ublk_bpf *bpf_obj;
 
 	strcpy(tgt_json.name, "loop");
 
@@ -218,6 +237,10 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
 			jbuf = ublksrv_tgt_realloc_json_buf(dev, &jbuf_size);
 	} while (ret < 0);
 
+	if (tgt->tgt_bpf_obj) {
+		bpf_obj = (struct ublk_bpf *)tgt->tgt_bpf_obj;
+		bpf_obj->data->target_fd = tgt->fds[1];
+	}
 	return 0;
 }
 
@@ -252,9 +275,14 @@ static int loop_queue_tgt_io(const struct ublksrv_queue *q,
 		const struct ublk_io_data *data, int tag)
 {
 	const struct ublksrv_io_desc *iod = data->iod;
-	struct io_uring_sqe *sqe = io_uring_get_sqe(q->ring_ptr);
+	struct io_uring_sqe *sqe;
 	unsigned ublk_op = ublksrv_get_op(iod);
 
+	/* ebpf prog wil handle read/write requests. */
+	if ((ublk_op == UBLK_IO_OP_READ) || (ublk_op == UBLK_IO_OP_WRITE))
+		return 1;
+
+	sqe = io_uring_get_sqe(q->ring_ptr);
 	if (!sqe)
 		return 0;
 
@@ -374,6 +402,7 @@ struct ublksrv_tgt_type  loop_tgt_type = {
 	.type	= UBLKSRV_TGT_TYPE_LOOP,
 	.name	=  "loop",
 	.recovery_tgt = loop_recovery_tgt,
+	.init_queue_bpf = loop_init_queue_bpf,
 };
 
 static void tgt_loop_init() __attribute__((constructor));
diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
index 5ed328d..d3796cf 100644
--- a/ublksrv_tgt.cpp
+++ b/ublksrv_tgt.cpp
@@ -2,6 +2,7 @@
 
 #include "config.h"
 #include "ublksrv_tgt.h"
+#include "bpf/.tmp/ublk.skel.h"
 
 /* per-task variable */
 static pthread_mutex_t jbuf_lock;
@@ -575,6 +576,31 @@ static void ublksrv_tgt_set_params(struct ublksrv_ctrl_dev *cdev,
 	}
 }
 
+static int ublksrv_tgt_load_bpf_prog(struct ublksrv_ctrl_dev *cdev)
+{
+	struct ublk_bpf *obj;
+	int ret, io_prep_fd, io_submit_fd;
+
+	obj = ublk_bpf__open();
+	if (!obj) {
+		fprintf(stderr, "failed to open BPF object\n");
+		return -1;
+	}
+	ret = ublk_bpf__load(obj);
+	if (ret) {
+		fprintf(stderr, "failed to load BPF object\n");
+		return -1;
+	}
+
+
+	io_prep_fd = bpf_program__fd(obj->progs.ublk_io_prep_prog);
+	io_submit_fd = bpf_program__fd(obj->progs.ublk_io_submit_prog);
+	ret = ublksrv_ctrl_reg_bpf_prog(cdev, io_prep_fd, io_submit_fd);
+	if (!ret)
+		ublksrv_ctrl_set_bpf_obj_info(cdev, obj);
+	return ret;
+}
+
 static int cmd_dev_add(int argc, char *argv[])
 {
 	static const struct option longopts[] = {
@@ -696,6 +722,13 @@ static int cmd_dev_add(int argc, char *argv[])
 		goto fail;
 	}
 
+	ret = ublksrv_tgt_load_bpf_prog(dev);
+	if (ret < 0) {
+		fprintf(stderr, "dev %d load bpf prog failed, ret %d\n",
+			data.dev_id, ret);
+		goto fail_stop_daemon;
+	}
+
 	{
 		const struct ublksrv_ctrl_dev_info *info =
 			ublksrv_ctrl_get_dev_info(dev);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk
  2023-02-15  0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
                   ` (3 preceding siblings ...)
  2023-02-15  0:46 ` [UBLKSRV] Add " Xiaoguang Wang
@ 2023-02-15  8:40 ` Ziyang Zhang
  4 siblings, 0 replies; 18+ messages in thread
From: Ziyang Zhang @ 2023-02-15  8:40 UTC (permalink / raw)
  To: linux-block, io-uring, bpf; +Cc: ming.lei, axboe, asml.silence, Xiaoguang Wang

On 2023/2/15 08:41, Xiaoguang Wang wrote:
> Normally, userspace block device impementations need to copy data between
> kernel block layer's io requests and userspace block device's userspace
> daemon, for example, ublk and tcmu both have similar logic, but this
> operation will consume cpu resources obviously, especially for large io.
> 
> There are methods trying to reduce these cpu overheads, then userspace
> block device's io performance will be improved further. These methods
> contain: 1) use special hardware to do memory copy, but seems not all
> architectures have these special hardware; 2) sofeware methods, such as
> mmap kernel block layer's io requests's data to userspace daemon [1],
> but it has page table's map/unmap, tlb flush overhead, security issue,
> etc, and it maybe only friendly to large io.
> 
> Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
> framework for implementing block device logic from userspace. Typical
> userspace block device impementations need to copy data between kernel
> block layer's io requests and userspace block device's userspace daemon,
> which will consume cpu resources, especially for large io.
> 
> To solve this problem, I'd propose a new method, which will combine the
> respective advantages of io_uring and ebpf. Add a new program type
> BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should
> register an ebpf prog. This bpf prog will use bpf helper offered by ublk
> bpf prog type to submit io requests on behalf of daemon process.
> Currently there is only one helper:
>     u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx,
> 		struct io_uring_sqe *sqe, u32 sqe_len, u32, fd)
> 
> This helper will use io_uring to submit io requests, so we need to make
> io_uring be able to submit a sqe located in kernel(Some codes idea comes
> from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes
> from userspace addr), and bpf prog initializes sqes, but does not need to
> initializes sqes' buf field, sqe->buf will come from kernel block layer io
> requests in some form. See patch 2 for more.
> 
> In example of ublk loop target, we can easily implement such below logic in
> ebpf prog:
>   1. userspace daemon registers an ebpf prog and passes two backend file
> fd in ebpf map structure。
>   2. For kernel io requests against the first half of userspace device,
> ebpf prog prepares an io_uring sqe, which will submit io against the first
> backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
> io requests against second half of userspace device has similar logic,
> only sqe's fd will be the second backend file fd.
>   3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
> be executed and completes kernel io requests.
> 
> That means, by using ebpf, we can implement various userspace log in kernel.
> 
> From above expample, we can see that this method has 3 advantages at least:
>   1. Remove memory copy between kernel block layer and userspace daemon
> completely.
>   2. Save memory. Userspace daemon doesn't need to maintain memory to
> issue and complete io requests, and use kernel block layer io requests
> memory directly.
>   2. We may reduce the number of round trips between kernel and userspace
> daemon, so may reduce kernel & userspace context switch overheads.
> 
> Test:
> Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file
> 
> fio job file:
>   [global]
>   direct=1
>   filename=/dev/ublkb0
>   time_based
>   runtime=60
>   numjobs=1
>   cpus_allowed=1
>   
>   [rand-read-4k]
>   bs=512K
>   iodepth=16
>   ioengine=libaio
>   rw=randwrite
>   stonewall
> 
> 
> Without this patch:
>   WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec
>   ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool.
> 
> With this patch:
>   WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec
>   ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool.
> 
> From above tests, this method can reduce cpu copy overhead obviously.
> 
> 
> TODO:
> I must say this patchset is just a RFC for design.
> 
> 1) Currently for this patchset, I just make ublk ebpf prog submit io requests
> using io_uring in kernel, cqe event still needs to be handled in userspace
> daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
> ebpf prog can implement io in kernel.
> 
> 2) ublk driver needs to work better with ebpf, currently I did some hack
> codes to support ebpf in ublk driver, it only can support write requests.
> 
> 3) I have not done much tests yet, will run liburing/ublk/blktests
> later.
> 
> Any review and suggestions are welcome, thanks.
> 
> [1] https://lore.kernel.org/all/20220318095531.15479-1-xiaoguang.wang@linux.alibaba.com/
> [2] https://lore.kernel.org/all/cover.1621424513.git.asml.silence@gmail.com/
> 
> 
> Xiaoguang Wang (3):
>   bpf: add UBLK program type
>   io_uring: enable io_uring to submit sqes located in kernel
>   ublk_drv: add ebpf support
> 
>  drivers/block/ublk_drv.c       | 228 ++++++++++++++++++++++++++++++++-
>  include/linux/bpf_types.h      |   2 +
>  include/linux/io_uring.h       |  13 ++
>  include/linux/io_uring_types.h |   8 +-
>  include/uapi/linux/bpf.h       |   2 +
>  include/uapi/linux/ublk_cmd.h  |  11 ++
>  io_uring/io_uring.c            |  59 ++++++++-
>  io_uring/rsrc.c                |  15 +++
>  io_uring/rsrc.h                |   3 +
>  io_uring/rw.c                  |   7 +
>  kernel/bpf/syscall.c           |   1 +
>  kernel/bpf/verifier.c          |   9 +-
>  scripts/bpf_doc.py             |   4 +
>  tools/include/uapi/linux/bpf.h |   9 ++
>  tools/lib/bpf/libbpf.c         |   2 +
>  15 files changed, 366 insertions(+), 7 deletions(-)
> 

Hi, Here is perf report output of ublk daemon(loop target):


+   57.96%     4.03%  ublk           liburing.so.2.2                                [.] _io_uring_get_cqe                    ▒
+   53.94%     0.00%  ublk           [kernel.vmlinux]                               [k] entry_SYSCALL_64                     ◆
+   53.94%     0.65%  ublk           [kernel.vmlinux]                               [k] do_syscall_64                        ▒
+   48.37%     1.18%  ublk           [kernel.vmlinux]                               [k] __do_sys_io_uring_enter              ▒
+   42.92%     1.72%  ublk           [kernel.vmlinux]                               [k] io_cqring_wait                       ▒
+   35.17%     0.06%  ublk           [kernel.vmlinux]                               [k] task_work_run                        ▒
+   34.75%     0.53%  ublk           [kernel.vmlinux]                               [k] io_run_task_work_sig                 ▒
+   33.45%     0.00%  ublk           [kernel.vmlinux]                               [k] ublk_bpf_io_submit_fn                ▒
+   33.16%     0.06%  ublk           bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog  [k] bpf_prog_3bdc6181a3c616fb_ublk_io_sub▒
+   32.68%     0.00%  iou-wrk-18583  [unknown]                                      [k] 0000000000000000                     ▒
+   32.68%     0.00%  iou-wrk-18583  [unknown]                                      [k] 0x00007efe920b1040                   ▒
+   32.68%     0.00%  iou-wrk-18583  [kernel.vmlinux]                               [k] ret_from_fork                        ▒
+   32.68%     0.47%  iou-wrk-18583  [kernel.vmlinux]                               [k] io_wqe_worker                        ▒
+   30.61%     0.00%  ublk           [kernel.vmlinux]                               [k] io_submit_sqe                        ▒
+   30.31%     0.06%  ublk           [kernel.vmlinux]                               [k] io_issue_sqe                         ▒
+   28.00%     0.00%  ublk           [kernel.vmlinux]                               [k] bpf_ublk_queue_sqe                   ▒
+   28.00%     0.00%  ublk           [kernel.vmlinux]                               [k] io_uring_submit_sqe                  ▒
+   27.18%     0.00%  ublk           [kernel.vmlinux]                               [k] io_write                             ▒
+   27.18%     0.00%  ublk           [xfs]                                          [k] xfs_file_write_iter

The call stack is:

-   57.96%     4.03%  ublk           liburing.so.2.2                                [.] _io_uring_get_cqe                    ◆
   - 53.94% _io_uring_get_cqe                                                                                                ▒
        entry_SYSCALL_64                                                                                                     ▒
      - do_syscall_64                                                                                                        ▒
         - 48.37% __do_sys_io_uring_enter                                                                                    ▒
            - 42.92% io_cqring_wait                                                                                          ▒
               - 34.75% io_run_task_work_sig                                                                                 ▒
                  - task_work_run                                                                                            ▒
                     - 32.50% ublk_bpf_io_submit_fn                                                                          ▒
                        - 32.21% bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog                                               ▒
                           - 27.12% bpf_ublk_queue_sqe                                                                       ▒
                              - io_uring_submit_sqe                                                                          ▒
                                 - 26.64% io_submit_sqe                                                                      ▒
                                    - 26.35% io_issue_sqe                                                                    ▒
                                       - io_write                                                                            ▒
                                         xfs_file_write_iter                                                                 ▒

Here, "io_submit" ebpf prog will be run in task_work of ublk daemon
process after io_uring_enter() syscall. In this ebpf prog, a sqe is
built and submitted. All information about this blk-mq request is
stored in a "ctx". Then io_uring can write to the backing file
(xfs_file_write_iter).

Here is call stack from perf report output of fio:

-    5.04%     0.18%  fio      [kernel.vmlinux]                             [k] ublk_queue_rq                                ▒
   - 4.86% ublk_queue_rq                                                                                                     ▒
      - 3.67% bpf_prog_b8456549dbe40c37_ublk_io_prep_prog                                                                    ▒
         - 3.10% bpf_trace_printk                                                                                            ▒
              2.83% _raw_spin_unlock_irqrestore                                                                              ▒
      - 0.70% task_work_add                                                                                                  ▒
         - try_to_wake_up                                                                                                    ▒
              _raw_spin_unlock_irqrestore                                                                                    ▒

Here, "io_prep" ebpf prog will be run in "ublk_queue_rq" process.
In this ebpf prog, qid, tag, nr_sectors, start_sector, op, flags
will be stored in one "ctx". Then we add a task_work to the ublk
daemon process.

Regards,
Zhang

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
  2023-02-15  0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
@ 2023-02-15 14:17   ` kernel test robot
  2023-02-15 14:50   ` kernel test robot
  2023-02-15 15:41   ` kernel test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-02-15 14:17 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: oe-kbuild-all

Hi Xiaoguang,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on bpf/master]
[also build test ERROR on axboe-block/for-next linus/master v6.2-rc8]
[cannot apply to bpf-next/master next-20230215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230215004122.28917-3-xiaoguang.wang%40linux.alibaba.com
patch subject: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
config: i386-tinyconfig (https://download.01.org/0day-ci/archive/20230215/202302152202.3aHc65jB-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/0d433a1cf7f666e5596514da7fab92f2691463ad
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
        git checkout 0d433a1cf7f666e5596514da7fab92f2691463ad
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 olddefconfig
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302152202.3aHc65jB-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from kernel/fork.c:97:
>> include/linux/io_uring.h:107:5: warning: no previous prototype for 'io_uring_submit_sqe' [-Wmissing-prototypes]
     107 | int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
         |     ^~~~~~~~~~~~~~~~~~~
--
   In file included from kernel/exit.c:67:
>> include/linux/io_uring.h:107:5: warning: no previous prototype for 'io_uring_submit_sqe' [-Wmissing-prototypes]
     107 | int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
         |     ^~~~~~~~~~~~~~~~~~~
   kernel/exit.c:1901:13: warning: no previous prototype for 'abort' [-Wmissing-prototypes]
    1901 | __weak void abort(void)
         |             ^~~~~
--
   ld: kernel/exit.o: in function `io_uring_submit_sqe':
>> exit.c:(.text+0x4b6): multiple definition of `io_uring_submit_sqe'; kernel/fork.o:fork.c:(.text+0x5bb): first defined here
   ld: fs/exec.o: in function `io_uring_submit_sqe':
   exec.c:(.text+0x9c5): multiple definition of `io_uring_submit_sqe'; kernel/fork.o:fork.c:(.text+0x5bb): first defined here

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
  2023-02-15  0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
  2023-02-15 14:17   ` kernel test robot
@ 2023-02-15 14:50   ` kernel test robot
  2023-02-15 15:41   ` kernel test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-02-15 14:50 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: oe-kbuild-all

Hi Xiaoguang,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on bpf/master]
[also build test WARNING on axboe-block/for-next linus/master v6.2-rc8]
[cannot apply to bpf-next/master next-20230215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230215004122.28917-3-xiaoguang.wang%40linux.alibaba.com
patch subject: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
config: arm-zeus_defconfig (https://download.01.org/0day-ci/archive/20230215/202302152219.3euu62fj-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/0d433a1cf7f666e5596514da7fab92f2691463ad
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
        git checkout 0d433a1cf7f666e5596514da7fab92f2691463ad
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302152219.3euu62fj-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from include/linux/bits.h:6,
                    from include/linux/ratelimit_types.h:5,
                    from include/linux/printk.h:9,
                    from include/asm-generic/bug.h:22,
                    from arch/arm/include/asm/bug.h:60,
                    from include/linux/bug.h:5,
                    from include/linux/thread_info.h:13,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/arm/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:78,
                    from include/linux/spinlock.h:56,
                    from include/linux/wait.h:9,
                    from include/linux/wait_bit.h:8,
                    from include/linux/fs.h:6,
                    from include/linux/highmem.h:5,
                    from kernel/sched/core.c:9:
>> include/vdso/bits.h:7:40: warning: left shift count >= width of type [-Wshift-count-overflow]
       7 | #define BIT(nr)                 (UL(1) << (nr))
         |                                        ^~
   include/linux/io_uring_types.h:472:35: note: in expansion of macro 'BIT'
     472 |         REQ_F_KBUF              = BIT(REQ_F_KBUF_BIT),
         |                                   ^~~


vim +7 include/vdso/bits.h

3945ff37d2f48d Vincenzo Frascino 2020-03-20  6  
3945ff37d2f48d Vincenzo Frascino 2020-03-20 @7  #define BIT(nr)			(UL(1) << (nr))
3945ff37d2f48d Vincenzo Frascino 2020-03-20  8  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
  2023-02-15  0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
  2023-02-15 14:17   ` kernel test robot
  2023-02-15 14:50   ` kernel test robot
@ 2023-02-15 15:41   ` kernel test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-02-15 15:41 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: llvm, oe-kbuild-all

Hi Xiaoguang,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on bpf/master]
[also build test WARNING on axboe-block/for-next linus/master v6.2-rc8]
[cannot apply to bpf-next/master next-20230215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230215004122.28917-3-xiaoguang.wang%40linux.alibaba.com
patch subject: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
config: x86_64-randconfig-a002-20230213 (https://download.01.org/0day-ci/archive/20230215/202302152315.W5hlZK5P-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/0d433a1cf7f666e5596514da7fab92f2691463ad
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
        git checkout 0d433a1cf7f666e5596514da7fab92f2691463ad
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash net/unix/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302152315.W5hlZK5P-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from net/unix/scm.c:11:
>> include/linux/io_uring.h:107:5: warning: no previous prototype for function 'io_uring_submit_sqe' [-Wmissing-prototypes]
   int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
       ^
   include/linux/io_uring.h:107:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
   ^
   static 
   1 warning generated.


vim +/io_uring_submit_sqe +107 include/linux/io_uring.h

    56	
    57	static inline void io_uring_files_cancel(void)
    58	{
    59		if (current->io_uring) {
    60			io_uring_unreg_ringfd();
    61			__io_uring_cancel(false);
    62		}
    63	}
    64	static inline void io_uring_task_cancel(void)
    65	{
    66		if (current->io_uring)
    67			__io_uring_cancel(true);
    68	}
    69	static inline void io_uring_free(struct task_struct *tsk)
    70	{
    71		if (tsk->io_uring)
    72			__io_uring_free(tsk);
    73	}
    74	int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
    75				struct io_mapped_kbuf *kbuf);
    76	#else
    77	static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
    78				      struct iov_iter *iter, void *ioucmd)
    79	{
    80		return -EOPNOTSUPP;
    81	}
    82	static inline void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret,
    83			ssize_t ret2)
    84	{
    85	}
    86	static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd,
    87				void (*task_work_cb)(struct io_uring_cmd *))
    88	{
    89	}
    90	static inline struct sock *io_uring_get_socket(struct file *file)
    91	{
    92		return NULL;
    93	}
    94	static inline void io_uring_task_cancel(void)
    95	{
    96	}
    97	static inline void io_uring_files_cancel(void)
    98	{
    99	}
   100	static inline void io_uring_free(struct task_struct *tsk)
   101	{
   102	}
   103	static inline const char *io_uring_get_opcode(u8 opcode)
   104	{
   105		return "";
   106	}
 > 107	int io_uring_submit_sqe(int fd, const struct io_uring_sqe *sqe, u32 sqe_len,
   108				struct io_mapped_kbuf *kbuf)
   109	{
   110		return 0;
   111	}
   112	#endif
   113	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/3] bpf: add UBLK program type
  2023-02-15  0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
@ 2023-02-15 15:52   ` kernel test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-02-15 15:52 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: oe-kbuild-all

Hi Xiaoguang,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on bpf/master]
[also build test ERROR on axboe-block/for-next linus/master v6.2-rc8]
[cannot apply to bpf-next/master next-20230215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230215004122.28917-2-xiaoguang.wang%40linux.alibaba.com
patch subject: [RFC 1/3] bpf: add UBLK program type
config: m68k-defconfig (https://download.01.org/0day-ci/archive/20230215/202302152356.3cUGk3rZ-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/7ab45e74c89ce799fc1df381389fa7364cd31466
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
        git checkout 7ab45e74c89ce799fc1df381389fa7364cd31466
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=m68k SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302152356.3cUGk3rZ-lkp@intel.com/

All errors (new ones prefixed by >>):

>> m68k-linux-ld: kernel/bpf/syscall.o:(.rodata+0x3a2): undefined reference to `bpf_ublk_prog_ops'
>> m68k-linux-ld: kernel/bpf/verifier.o:(.rodata+0x4d2): undefined reference to `bpf_ublk_verifier_ops'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 3/3] ublk_drv: add ebpf support
  2023-02-15  0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
@ 2023-02-16  8:11   ` Ming Lei
  2023-02-16 12:12     ` Xiaoguang Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2023-02-16  8:11 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang, ming.lei

On Wed, Feb 15, 2023 at 08:41:22AM +0800, Xiaoguang Wang wrote:
> Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
> can use this helper to write ebpf prog to support ublk kernel & usersapce
> zero copy, please see ublksrv test codes for more info.
> 
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> ---
>  drivers/block/ublk_drv.c       | 207 ++++++++++++++++++++++++++++++++-
>  include/uapi/linux/bpf.h       |   1 +
>  include/uapi/linux/ublk_cmd.h  |  11 ++
>  scripts/bpf_doc.py             |   4 +
>  tools/include/uapi/linux/bpf.h |   8 ++
>  5 files changed, 229 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index b628e9eaefa6..44c289b72864 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -61,6 +61,7 @@
>  struct ublk_rq_data {
>  	struct llist_node node;
>  	struct callback_head work;
> +	struct io_mapped_kbuf *kbuf;
>  };
>  
>  struct ublk_uring_cmd_pdu {
> @@ -163,6 +164,9 @@ struct ublk_device {
>  	unsigned int		nr_queues_ready;
>  	atomic_t		nr_aborted_queues;
>  
> +	struct bpf_prog		*io_prep_prog;
> +	struct bpf_prog		*io_submit_prog;
> +
>  	/*
>  	 * Our ubq->daemon may be killed without any notification, so
>  	 * monitor each queue's daemon periodically
> @@ -189,10 +193,46 @@ static DEFINE_MUTEX(ublk_ctl_mutex);
>  
>  static struct miscdevice ublk_misc;
>  
> +struct ublk_io_bpf_ctx {
> +	struct ublk_bpf_ctx ctx;
> +	struct ublk_device *ub;
> +	struct callback_head work;
> +};
> +
> +BPF_CALL_4(bpf_ublk_queue_sqe, struct ublk_io_bpf_ctx *, bpf_ctx,
> +	   struct io_uring_sqe *, sqe, u32, sqe_len, u32, fd)
> +{
> +	struct request *rq;
> +	struct ublk_rq_data *data;
> +	struct io_mapped_kbuf *kbuf;
> +	u16 q_id = bpf_ctx->ctx.q_id;
> +	u16 tag = bpf_ctx->ctx.tag;
> +
> +	rq = blk_mq_tag_to_rq(bpf_ctx->ub->tag_set.tags[q_id], tag);
> +	data = blk_mq_rq_to_pdu(rq);
> +	kbuf = data->kbuf;
> +	io_uring_submit_sqe(fd, sqe, sqe_len, kbuf);
> +	return 0;
> +}
> +
> +const struct bpf_func_proto ublk_bpf_queue_sqe_proto = {
> +	.func = bpf_ublk_queue_sqe,
> +	.gpl_only = false,
> +	.ret_type = RET_INTEGER,
> +	.arg1_type = ARG_ANYTHING,
> +	.arg2_type = ARG_ANYTHING,
> +	.arg3_type = ARG_ANYTHING,
> +};
> +
>  static const struct bpf_func_proto *
>  ublk_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  {
> -	return bpf_base_func_proto(func_id);
> +	switch (func_id) {
> +	case BPF_FUNC_ublk_queue_sqe:
> +		return &ublk_bpf_queue_sqe_proto;
> +	default:
> +		return bpf_base_func_proto(func_id);
> +	}
>  }
>  
>  static bool ublk_bpf_is_valid_access(int off, int size,
> @@ -200,6 +240,23 @@ static bool ublk_bpf_is_valid_access(int off, int size,
>  			const struct bpf_prog *prog,
>  			struct bpf_insn_access_aux *info)
>  {
> +	if (off < 0 || off >= sizeof(struct ublk_bpf_ctx))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +
> +	switch (off) {
> +	case offsetof(struct ublk_bpf_ctx, q_id):
> +		return size == sizeof_field(struct ublk_bpf_ctx, q_id);
> +	case offsetof(struct ublk_bpf_ctx, tag):
> +		return size == sizeof_field(struct ublk_bpf_ctx, tag);
> +	case offsetof(struct ublk_bpf_ctx, op):
> +		return size == sizeof_field(struct ublk_bpf_ctx, op);
> +	case offsetof(struct ublk_bpf_ctx, nr_sectors):
> +		return size == sizeof_field(struct ublk_bpf_ctx, nr_sectors);
> +	case offsetof(struct ublk_bpf_ctx, start_sector):
> +		return size == sizeof_field(struct ublk_bpf_ctx, start_sector);
> +	}
>  	return false;
>  }
>  
> @@ -324,7 +381,7 @@ static void ublk_put_device(struct ublk_device *ub)
>  static inline struct ublk_queue *ublk_get_queue(struct ublk_device *dev,
>  		int qid)
>  {
> -       return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
> +	return (struct ublk_queue *)&(dev->__queues[qid * dev->queue_size]);
>  }
>  
>  static inline bool ublk_rq_has_data(const struct request *rq)
> @@ -492,12 +549,16 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
>  static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
>  		struct ublk_io *io)
>  {
> +	struct ublk_device *ub = ubq->dev;
>  	const unsigned int rq_bytes = blk_rq_bytes(req);
>  	/*
>  	 * no zero copy, we delay copy WRITE request data into ublksrv
>  	 * context and the big benefit is that pinning pages in current
>  	 * context is pretty fast, see ublk_pin_user_pages
>  	 */
> +	if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
> +		return rq_bytes;

Can you explain a bit why READ isn't supported? Because WRITE zero
copy is supposed to be supported easily with splice based approach,
and I am more interested in READ zc actually.

> +
>  	if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
>  		return rq_bytes;
>  
> @@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
>  	}
>  }
>  
> +static void ublk_bpf_io_submit_fn(struct callback_head *work)
> +{
> +	struct ublk_io_bpf_ctx *bpf_ctx = container_of(work,
> +			struct ublk_io_bpf_ctx, work);
> +
> +	if (bpf_ctx->ub->io_submit_prog)
> +		bpf_prog_run_pin_on_cpu(bpf_ctx->ub->io_submit_prog, bpf_ctx);
> +	kfree(bpf_ctx);
> +}
> +
> +static int ublk_init_uring_kbuf(struct request *rq)
> +{
> +	struct bio_vec *bvec;
> +	struct req_iterator rq_iter;
> +	struct bio_vec tmp;
> +	int nr_bvec = 0;
> +	struct io_mapped_kbuf *kbuf;
> +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> +
> +	/* Drop previous allocation */
> +	if (data->kbuf) {
> +		kfree(data->kbuf->bvec);
> +		kfree(data->kbuf);
> +		data->kbuf = NULL;
> +	}
> +
> +	kbuf = kmalloc(sizeof(struct io_mapped_kbuf), GFP_NOIO);
> +	if (!kbuf)
> +		return -EIO;
> +
> +	rq_for_each_bvec(tmp, rq, rq_iter)
> +		nr_bvec++;
> +
> +	bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), GFP_NOIO);
> +	if (!bvec) {
> +		kfree(kbuf);
> +		return -EIO;
> +	}
> +	kbuf->bvec = bvec;
> +	rq_for_each_bvec(tmp, rq, rq_iter) {
> +		*bvec = tmp;
> +		bvec++;
> +	}
> +
> +	kbuf->count = blk_rq_bytes(rq);
> +	kbuf->nr_bvecs = nr_bvec;
> +	data->kbuf = kbuf;
> +	return 0;

bio/req bvec table is immutable, so here you can pass its reference
to kbuf directly.

> +}
> +
> +static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
> +{
> +	int err;
> +	struct ublk_device *ub = ubq->dev;
> +	struct bpf_prog *prog = ub->io_prep_prog;
> +	struct ublk_io_bpf_ctx *bpf_ctx;
> +
> +	if (!prog)
> +		return 0;
> +
> +	bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
> +	if (!bpf_ctx)
> +		return -EIO;
> +
> +	err = ublk_init_uring_kbuf(rq);
> +	if (err < 0) {
> +		kfree(bpf_ctx);
> +		return -EIO;
> +	}
> +	bpf_ctx->ub = ub;
> +	bpf_ctx->ctx.q_id = ubq->q_id;
> +	bpf_ctx->ctx.tag = rq->tag;
> +	bpf_ctx->ctx.op = req_op(rq);
> +	bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
> +	bpf_ctx->ctx.start_sector = blk_rq_pos(rq);

The above is for setting up target io parameter, which is supposed
to be from userspace, cause it is result of user space logic. If
these parameters are from kernel, the whole logic has to be done
in io_prep_prog.

> +	bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
> +
> +	init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
> +	if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
> +		kfree(bpf_ctx);

task_work_add() is only available in case of ublk builtin.

> +	return 0;
> +}
> +
>  static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		const struct blk_mq_queue_data *bd)
>  {
> @@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	if (unlikely(res != BLK_STS_OK))
>  		return BLK_STS_IOERR;
>  
> +	/* Currently just for test. */
> +	ublk_run_bpf_prog(ubq, rq);

Can you explain the above comment a bit? When is the io_prep_prog called
in the non-test version? Or can you post the non-test version in list
for review.

Here it is the key for understanding the whole idea, especially when
is io_prep_prog called finally? How to pass parameters to io_prep_prog?

Given it is ebpf prog, I don't think any userspace parameter can be
passed to io_prep_prog when submitting IO, that means all user logic has
to be done inside io_prep_prog? If yes, not sure if it is one good way,
cause ebpf prog is very limited programming environment, but the user
logic could be as complicated as using btree to map io, or communicating
with remote machine for figuring out the mapping. Loop is just the
simplest direct mapping.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [UBLKSRV] Add ebpf support.
  2023-02-15  0:46 ` [UBLKSRV] Add " Xiaoguang Wang
@ 2023-02-16  8:28   ` Ming Lei
  2023-02-16  9:17     ` Xiaoguang Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2023-02-16  8:28 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang, ming.lei

On Wed, Feb 15, 2023 at 08:46:18AM +0800, Xiaoguang Wang wrote:
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> ---
>  bpf/ublk.bpf.c         | 168 +++++++++++++++++++++++++++++++++++++++++
>  include/ublk_cmd.h     |   2 +
>  include/ublksrv.h      |   8 ++
>  include/ublksrv_priv.h |   1 +
>  include/ublksrv_tgt.h  |   1 +
>  lib/ublksrv.c          |   4 +
>  lib/ublksrv_cmd.c      |  21 ++++++
>  tgt_loop.cpp           |  31 +++++++-
>  ublksrv_tgt.cpp        |  33 ++++++++
>  9 files changed, 268 insertions(+), 1 deletion(-)
>  create mode 100644 bpf/ublk.bpf.c
> 
> diff --git a/bpf/ublk.bpf.c b/bpf/ublk.bpf.c
> new file mode 100644
> index 0000000..80e79de
> --- /dev/null
> +++ b/bpf/ublk.bpf.c
> @@ -0,0 +1,168 @@
> +#include "vmlinux.h"

Where is vmlinux.h?

> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +
> +
> +static long (*bpf_ublk_queue_sqe)(void *ctx, struct io_uring_sqe *sqe,
> +		u32 sqe_len, u32 fd) = (void *) 212;
> +
> +int target_fd = -1;
> +
> +struct sqe_key {
> +	u16 q_id;
> +	u16 tag;
> +	u32 res;
> +	u64 offset;
> +};
> +
> +struct sqe_data {
> +	char data[128];
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, 8192);
> +	__type(key, struct sqe_key);
> +	__type(value, struct sqe_data);
> +} sqes_map SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, 128);
> +	__type(key, int);
> +	__type(value, int);
> +} uring_fd_map SEC(".maps");
> +
> +static inline void io_uring_prep_rw(__u8 op, struct io_uring_sqe *sqe, int fd,
> +				    const void *addr, unsigned len,
> +				    __u64 offset)
> +{
> +	sqe->opcode = op;
> +	sqe->flags = 0;
> +	sqe->ioprio = 0;
> +	sqe->fd = fd;
> +	sqe->off = offset;
> +	sqe->addr = (unsigned long) addr;
> +	sqe->len = len;
> +	sqe->fsync_flags = 0;
> +	sqe->buf_index = 0;
> +	sqe->personality = 0;
> +	sqe->splice_fd_in = 0;
> +	sqe->addr3 = 0;
> +	sqe->__pad2[0] = 0;
> +}
> +
> +static inline void io_uring_prep_nop(struct io_uring_sqe *sqe)
> +{
> +	io_uring_prep_rw(IORING_OP_NOP, sqe, -1, 0, 0, 0);
> +}
> +
> +static inline void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
> +			void *buf, unsigned nbytes, off_t offset)
> +{
> +	io_uring_prep_rw(IORING_OP_READ, sqe, fd, buf, nbytes, offset);
> +}
> +
> +static inline void io_uring_prep_write(struct io_uring_sqe *sqe, int fd,
> +	const void *buf, unsigned nbytes, off_t offset)
> +{
> +	io_uring_prep_rw(IORING_OP_WRITE, sqe, fd, buf, nbytes, offset);
> +}
> +
> +/*
> +static u64 submit_sqe(struct bpf_map *map, void *key, void *value, void *data)
> +{
> +	struct io_uring_sqe *sqe = (struct io_uring_sqe *)value;
> +	struct ublk_bpf_ctx *ctx = ((struct callback_ctx *)data)->ctx;
> +	struct sqe_key *skey = (struct sqe_key *)key;
> +	char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
> +	char fmt2[] ="submit sqe test prep\n";
> +	u16 qid, tag;
> +	int q_id = skey->q_id, *ring_fd;
> +
> +	bpf_trace_printk(fmt2, sizeof(fmt2));
> +	ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
> +	if (ring_fd) {
> +		bpf_trace_printk(fmt, sizeof(fmt), qid, skey->tag);
> +		bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
> +		bpf_map_delete_elem(map, key);
> +	}
> +	return 0;
> +}
> +*/
> +
> +static inline __u64 build_user_data(unsigned tag, unsigned op,
> +			unsigned tgt_data, unsigned is_target_io,
> +			unsigned is_bpf_io)
> +{
> +	return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63 |
> +		(__u64)is_bpf_io << 60;
> +}
> +
> +SEC("ublk.s/")
> +int ublk_io_prep_prog(struct ublk_bpf_ctx *ctx)
> +{
> +	struct io_uring_sqe *sqe;
> +	struct sqe_data sd = {0};
> +	struct sqe_key key;
> +	u16 q_id = ctx->q_id;
> +	u8 op; // = ctx->op;
> +	u32 nr_sectors = ctx->nr_sectors;
> +	u64 start_sector = ctx->start_sector;
> +	char fmt_1[] ="ublk_io_prep_prog %d %d\n";
> +
> +	key.q_id = ctx->q_id;
> +	key.tag = ctx->tag;
> +	key.offset = 0;
> +	key.res = 0;
> +
> +	bpf_probe_read_kernel(&op, 1, &ctx->op);
> +	bpf_trace_printk(fmt_1, sizeof(fmt_1), q_id, op);
> +	sqe = (struct io_uring_sqe *)&sd;
> +	if (op == REQ_OP_READ) {
> +		char fmt[] ="add read sae\n";
> +
> +		bpf_trace_printk(fmt, sizeof(fmt));
> +		io_uring_prep_read(sqe, target_fd, 0, nr_sectors << 9,
> +				   start_sector << 9);
> +		sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
> +		bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
> +	} else if (op == REQ_OP_WRITE) {
> +		char fmt[] ="add write sae\n";
> +
> +		bpf_trace_printk(fmt, sizeof(fmt));
> +
> +		io_uring_prep_write(sqe, target_fd, 0, nr_sectors << 9,
> +				    start_sector << 9);
> +		sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
> +		bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
> +	} else {
> +		;
> +	}
> +	return 0;
> +}
> +
> +SEC("ublk.s/")
> +int ublk_io_submit_prog(struct ublk_bpf_ctx *ctx)
> +{
> +	struct io_uring_sqe *sqe;
> +	char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
> +	int q_id = ctx->q_id, *ring_fd;
> +	struct sqe_key key;
> +
> +	key.q_id = ctx->q_id;
> +	key.tag = ctx->tag;
> +	key.offset = 0;
> +	key.res = 0;
> +
> +	sqe = bpf_map_lookup_elem(&sqes_map, &key);
> +	ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
> +	if (ring_fd) {
> +		bpf_trace_printk(fmt, sizeof(fmt), key.q_id, key.tag);
> +		bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
> +		bpf_map_delete_elem(&sqes_map, &key);
> +	}
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "GPL";
> diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
> index f6238cc..893ba8c 100644
> --- a/include/ublk_cmd.h
> +++ b/include/ublk_cmd.h
> @@ -17,6 +17,8 @@
>  #define	UBLK_CMD_STOP_DEV	0x07
>  #define	UBLK_CMD_SET_PARAMS	0x08
>  #define	UBLK_CMD_GET_PARAMS	0x09
> +#define UBLK_CMD_REG_BPF_PROG		0x0a
> +#define UBLK_CMD_UNREG_BPF_PROG		0x0b
>  #define	UBLK_CMD_START_USER_RECOVERY	0x10
>  #define	UBLK_CMD_END_USER_RECOVERY	0x11
>  #define	UBLK_CMD_GET_DEV_INFO2		0x12
> diff --git a/include/ublksrv.h b/include/ublksrv.h
> index d38bd46..f5deddb 100644
> --- a/include/ublksrv.h
> +++ b/include/ublksrv.h
> @@ -106,6 +106,7 @@ struct ublksrv_tgt_info {
>  	unsigned int nr_fds;
>  	int fds[UBLKSRV_TGT_MAX_FDS];
>  	void *tgt_data;
> +	void *tgt_bpf_obj;
>  
>  	/*
>  	 * Extra IO slots for each queue, target code can reserve some
> @@ -263,6 +264,8 @@ struct ublksrv_tgt_type {
>  	int (*init_queue)(const struct ublksrv_queue *, void **queue_data_ptr);
>  	void (*deinit_queue)(const struct ublksrv_queue *);
>  
> +	int (*init_queue_bpf)(const struct ublksrv_dev *dev, const struct ublksrv_queue *q);
> +
>  	unsigned long reserved[5];
>  };
>  
> @@ -318,6 +321,11 @@ extern void ublksrv_ctrl_prep_recovery(struct ublksrv_ctrl_dev *dev,
>  		const char *recovery_jbuf);
>  extern const char *ublksrv_ctrl_get_recovery_jbuf(const struct ublksrv_ctrl_dev *dev);
>  
> +extern void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,
> +					  void *obj);
> +extern int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
> +				     int io_prep_fd, int io_submit_fd);
> +
>  /* ublksrv device ("/dev/ublkcN") level APIs */
>  extern const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *
>  		ctrl_dev);
> diff --git a/include/ublksrv_priv.h b/include/ublksrv_priv.h
> index 2996baa..8da8866 100644
> --- a/include/ublksrv_priv.h
> +++ b/include/ublksrv_priv.h
> @@ -42,6 +42,7 @@ struct ublksrv_ctrl_dev {
>  
>  	const char *tgt_type;
>  	const struct ublksrv_tgt_type *tgt_ops;
> +	void *bpf_obj;
>  
>  	/*
>  	 * default is UBLKSRV_RUN_DIR but can be specified via command line,
> diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
> index 234d31e..e0db7d9 100644
> --- a/include/ublksrv_tgt.h
> +++ b/include/ublksrv_tgt.h
> @@ -9,6 +9,7 @@
>  #include <getopt.h>
>  #include <string.h>
>  #include <stdarg.h>
> +#include <limits.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <sys/ioctl.h>
> diff --git a/lib/ublksrv.c b/lib/ublksrv.c
> index 96bed95..110ccb3 100644
> --- a/lib/ublksrv.c
> +++ b/lib/ublksrv.c
> @@ -603,6 +603,9 @@ skip_alloc_buf:
>  		goto fail;
>  	}
>  
> +	if (dev->tgt.ops->init_queue_bpf)
> +		dev->tgt.ops->init_queue_bpf(tdev, local_to_tq(q));
> +
>  	ublksrv_dev_init_io_cmds(dev, q);
>  
>  	/*
> @@ -723,6 +726,7 @@ const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *ctrl_d
>  	}
>  
>  	tgt->fds[0] = dev->cdev_fd;
> +	tgt->tgt_bpf_obj = ctrl_dev->bpf_obj;
>  
>  	ret = ublksrv_tgt_init(dev, ctrl_dev->tgt_type, ctrl_dev->tgt_ops,
>  			ctrl_dev->tgt_argc, ctrl_dev->tgt_argv);
> diff --git a/lib/ublksrv_cmd.c b/lib/ublksrv_cmd.c
> index 0d7265d..0101cb9 100644
> --- a/lib/ublksrv_cmd.c
> +++ b/lib/ublksrv_cmd.c
> @@ -502,6 +502,27 @@ int ublksrv_ctrl_end_recovery(struct ublksrv_ctrl_dev *dev, int daemon_pid)
>  	return ret;
>  }
>  
> +int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
> +			      int io_prep_fd, int io_submit_fd)
> +{
> +	struct ublksrv_ctrl_cmd_data data = {
> +		.cmd_op = UBLK_CMD_REG_BPF_PROG,
> +		.flags = CTRL_CMD_HAS_DATA,
> +	};
> +	int ret;
> +
> +	data.data[0] = io_prep_fd;
> +	data.data[1] = io_submit_fd;
> +
> +	ret = __ublksrv_ctrl_cmd(dev, &data);
> +	return ret;
> +}
> +
> +void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,  void *obj)
> +{
> +	dev->bpf_obj = obj;
> +}
> +
>  const struct ublksrv_ctrl_dev_info *ublksrv_ctrl_get_dev_info(
>  		const struct ublksrv_ctrl_dev *dev)
>  {
> diff --git a/tgt_loop.cpp b/tgt_loop.cpp
> index 79a65d3..b1568fe 100644
> --- a/tgt_loop.cpp
> +++ b/tgt_loop.cpp
> @@ -4,7 +4,11 @@
>  
>  #include <poll.h>
>  #include <sys/epoll.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
>  #include "ublksrv_tgt.h"
> +#include "bpf/.tmp/ublk.skel.h"

Where is bpf/.tmp/ublk.skel.h?

>  
>  static bool backing_supports_discard(char *name)
>  {
> @@ -88,6 +92,20 @@ static int loop_recovery_tgt(struct ublksrv_dev *dev, int type)
>  	return 0;
>  }
>  
> +static int loop_init_queue_bpf(const struct ublksrv_dev *dev,
> +			       const struct ublksrv_queue *q)
> +{
> +	int ret, q_id, ring_fd;
> +	const struct ublksrv_tgt_info *tgt = &dev->tgt;
> +	struct ublk_bpf *obj = (struct ublk_bpf*)tgt->tgt_bpf_obj;
> +
> +	q_id = q->q_id;
> +	ring_fd = q->ring_ptr->ring_fd;
> +	ret = bpf_map_update_elem(bpf_map__fd(obj->maps.uring_fd_map), &q_id,
> +				  &ring_fd,  0);
> +	return ret;
> +}
> +
>  static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>  		*argv[])
>  {
> @@ -125,6 +143,7 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>  		},
>  	};
>  	bool can_discard = false;
> +	struct ublk_bpf *bpf_obj;
>  
>  	strcpy(tgt_json.name, "loop");
>  
> @@ -218,6 +237,10 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>  			jbuf = ublksrv_tgt_realloc_json_buf(dev, &jbuf_size);
>  	} while (ret < 0);
>  
> +	if (tgt->tgt_bpf_obj) {
> +		bpf_obj = (struct ublk_bpf *)tgt->tgt_bpf_obj;
> +		bpf_obj->data->target_fd = tgt->fds[1];
> +	}
>  	return 0;
>  }
>  
> @@ -252,9 +275,14 @@ static int loop_queue_tgt_io(const struct ublksrv_queue *q,
>  		const struct ublk_io_data *data, int tag)
>  {
>  	const struct ublksrv_io_desc *iod = data->iod;
> -	struct io_uring_sqe *sqe = io_uring_get_sqe(q->ring_ptr);
> +	struct io_uring_sqe *sqe;
>  	unsigned ublk_op = ublksrv_get_op(iod);
>  
> +	/* ebpf prog wil handle read/write requests. */
> +	if ((ublk_op == UBLK_IO_OP_READ) || (ublk_op == UBLK_IO_OP_WRITE))
> +		return 1;
> +
> +	sqe = io_uring_get_sqe(q->ring_ptr);
>  	if (!sqe)
>  		return 0;
>  
> @@ -374,6 +402,7 @@ struct ublksrv_tgt_type  loop_tgt_type = {
>  	.type	= UBLKSRV_TGT_TYPE_LOOP,
>  	.name	=  "loop",
>  	.recovery_tgt = loop_recovery_tgt,
> +	.init_queue_bpf = loop_init_queue_bpf,
>  };
>  
>  static void tgt_loop_init() __attribute__((constructor));
> diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
> index 5ed328d..d3796cf 100644
> --- a/ublksrv_tgt.cpp
> +++ b/ublksrv_tgt.cpp
> @@ -2,6 +2,7 @@
>  
>  #include "config.h"
>  #include "ublksrv_tgt.h"
> +#include "bpf/.tmp/ublk.skel.h"

Same with above


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [UBLKSRV] Add ebpf support.
  2023-02-16  8:28   ` Ming Lei
@ 2023-02-16  9:17     ` Xiaoguang Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-16  9:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang

hello,

> On Wed, Feb 15, 2023 at 08:46:18AM +0800, Xiaoguang Wang wrote:
>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>> ---
>>  bpf/ublk.bpf.c         | 168 +++++++++++++++++++++++++++++++++++++++++
>>  include/ublk_cmd.h     |   2 +
>>  include/ublksrv.h      |   8 ++
>>  include/ublksrv_priv.h |   1 +
>>  include/ublksrv_tgt.h  |   1 +
>>  lib/ublksrv.c          |   4 +
>>  lib/ublksrv_cmd.c      |  21 ++++++
>>  tgt_loop.cpp           |  31 +++++++-
>>  ublksrv_tgt.cpp        |  33 ++++++++
>>  9 files changed, 268 insertions(+), 1 deletion(-)
>>  create mode 100644 bpf/ublk.bpf.c
>>
>> diff --git a/bpf/ublk.bpf.c b/bpf/ublk.bpf.c
>> new file mode 100644
>> index 0000000..80e79de
>> --- /dev/null
>> +++ b/bpf/ublk.bpf.c
>> @@ -0,0 +1,168 @@
>> +#include "vmlinux.h"
> Where is vmlinux.h?
Sorry, I forgot to attach Makefile in this commit, which will
show how to generate vmlinux.h and how to compile ebpf
prog object. I'll prepare v2 patch set to fix this issue soon.
Thanks for review.

Regards,
Xiaoguang Wang

>
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_core_read.h>
>> +
>> +
>> +static long (*bpf_ublk_queue_sqe)(void *ctx, struct io_uring_sqe *sqe,
>> +		u32 sqe_len, u32 fd) = (void *) 212;
>> +
>> +int target_fd = -1;
>> +
>> +struct sqe_key {
>> +	u16 q_id;
>> +	u16 tag;
>> +	u32 res;
>> +	u64 offset;
>> +};
>> +
>> +struct sqe_data {
>> +	char data[128];
>> +};
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_HASH);
>> +	__uint(max_entries, 8192);
>> +	__type(key, struct sqe_key);
>> +	__type(value, struct sqe_data);
>> +} sqes_map SEC(".maps");
>> +
>> +struct {
>> +	__uint(type, BPF_MAP_TYPE_ARRAY);
>> +	__uint(max_entries, 128);
>> +	__type(key, int);
>> +	__type(value, int);
>> +} uring_fd_map SEC(".maps");
>> +
>> +static inline void io_uring_prep_rw(__u8 op, struct io_uring_sqe *sqe, int fd,
>> +				    const void *addr, unsigned len,
>> +				    __u64 offset)
>> +{
>> +	sqe->opcode = op;
>> +	sqe->flags = 0;
>> +	sqe->ioprio = 0;
>> +	sqe->fd = fd;
>> +	sqe->off = offset;
>> +	sqe->addr = (unsigned long) addr;
>> +	sqe->len = len;
>> +	sqe->fsync_flags = 0;
>> +	sqe->buf_index = 0;
>> +	sqe->personality = 0;
>> +	sqe->splice_fd_in = 0;
>> +	sqe->addr3 = 0;
>> +	sqe->__pad2[0] = 0;
>> +}
>> +
>> +static inline void io_uring_prep_nop(struct io_uring_sqe *sqe)
>> +{
>> +	io_uring_prep_rw(IORING_OP_NOP, sqe, -1, 0, 0, 0);
>> +}
>> +
>> +static inline void io_uring_prep_read(struct io_uring_sqe *sqe, int fd,
>> +			void *buf, unsigned nbytes, off_t offset)
>> +{
>> +	io_uring_prep_rw(IORING_OP_READ, sqe, fd, buf, nbytes, offset);
>> +}
>> +
>> +static inline void io_uring_prep_write(struct io_uring_sqe *sqe, int fd,
>> +	const void *buf, unsigned nbytes, off_t offset)
>> +{
>> +	io_uring_prep_rw(IORING_OP_WRITE, sqe, fd, buf, nbytes, offset);
>> +}
>> +
>> +/*
>> +static u64 submit_sqe(struct bpf_map *map, void *key, void *value, void *data)
>> +{
>> +	struct io_uring_sqe *sqe = (struct io_uring_sqe *)value;
>> +	struct ublk_bpf_ctx *ctx = ((struct callback_ctx *)data)->ctx;
>> +	struct sqe_key *skey = (struct sqe_key *)key;
>> +	char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
>> +	char fmt2[] ="submit sqe test prep\n";
>> +	u16 qid, tag;
>> +	int q_id = skey->q_id, *ring_fd;
>> +
>> +	bpf_trace_printk(fmt2, sizeof(fmt2));
>> +	ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
>> +	if (ring_fd) {
>> +		bpf_trace_printk(fmt, sizeof(fmt), qid, skey->tag);
>> +		bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
>> +		bpf_map_delete_elem(map, key);
>> +	}
>> +	return 0;
>> +}
>> +*/
>> +
>> +static inline __u64 build_user_data(unsigned tag, unsigned op,
>> +			unsigned tgt_data, unsigned is_target_io,
>> +			unsigned is_bpf_io)
>> +{
>> +	return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63 |
>> +		(__u64)is_bpf_io << 60;
>> +}
>> +
>> +SEC("ublk.s/")
>> +int ublk_io_prep_prog(struct ublk_bpf_ctx *ctx)
>> +{
>> +	struct io_uring_sqe *sqe;
>> +	struct sqe_data sd = {0};
>> +	struct sqe_key key;
>> +	u16 q_id = ctx->q_id;
>> +	u8 op; // = ctx->op;
>> +	u32 nr_sectors = ctx->nr_sectors;
>> +	u64 start_sector = ctx->start_sector;
>> +	char fmt_1[] ="ublk_io_prep_prog %d %d\n";
>> +
>> +	key.q_id = ctx->q_id;
>> +	key.tag = ctx->tag;
>> +	key.offset = 0;
>> +	key.res = 0;
>> +
>> +	bpf_probe_read_kernel(&op, 1, &ctx->op);
>> +	bpf_trace_printk(fmt_1, sizeof(fmt_1), q_id, op);
>> +	sqe = (struct io_uring_sqe *)&sd;
>> +	if (op == REQ_OP_READ) {
>> +		char fmt[] ="add read sae\n";
>> +
>> +		bpf_trace_printk(fmt, sizeof(fmt));
>> +		io_uring_prep_read(sqe, target_fd, 0, nr_sectors << 9,
>> +				   start_sector << 9);
>> +		sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
>> +		bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
>> +	} else if (op == REQ_OP_WRITE) {
>> +		char fmt[] ="add write sae\n";
>> +
>> +		bpf_trace_printk(fmt, sizeof(fmt));
>> +
>> +		io_uring_prep_write(sqe, target_fd, 0, nr_sectors << 9,
>> +				    start_sector << 9);
>> +		sqe->user_data = build_user_data(ctx->tag, op, 0, 1, 1);
>> +		bpf_map_update_elem(&sqes_map, &key, &sd, BPF_NOEXIST);
>> +	} else {
>> +		;
>> +	}
>> +	return 0;
>> +}
>> +
>> +SEC("ublk.s/")
>> +int ublk_io_submit_prog(struct ublk_bpf_ctx *ctx)
>> +{
>> +	struct io_uring_sqe *sqe;
>> +	char fmt[] ="submit sqe for req[qid:%u tag:%u]\n";
>> +	int q_id = ctx->q_id, *ring_fd;
>> +	struct sqe_key key;
>> +
>> +	key.q_id = ctx->q_id;
>> +	key.tag = ctx->tag;
>> +	key.offset = 0;
>> +	key.res = 0;
>> +
>> +	sqe = bpf_map_lookup_elem(&sqes_map, &key);
>> +	ring_fd = bpf_map_lookup_elem(&uring_fd_map, &q_id);
>> +	if (ring_fd) {
>> +		bpf_trace_printk(fmt, sizeof(fmt), key.q_id, key.tag);
>> +		bpf_ublk_queue_sqe(ctx, sqe, 128, *ring_fd);
>> +		bpf_map_delete_elem(&sqes_map, &key);
>> +	}
>> +	return 0;
>> +}
>> +
>> +char LICENSE[] SEC("license") = "GPL";
>> diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
>> index f6238cc..893ba8c 100644
>> --- a/include/ublk_cmd.h
>> +++ b/include/ublk_cmd.h
>> @@ -17,6 +17,8 @@
>>  #define	UBLK_CMD_STOP_DEV	0x07
>>  #define	UBLK_CMD_SET_PARAMS	0x08
>>  #define	UBLK_CMD_GET_PARAMS	0x09
>> +#define UBLK_CMD_REG_BPF_PROG		0x0a
>> +#define UBLK_CMD_UNREG_BPF_PROG		0x0b
>>  #define	UBLK_CMD_START_USER_RECOVERY	0x10
>>  #define	UBLK_CMD_END_USER_RECOVERY	0x11
>>  #define	UBLK_CMD_GET_DEV_INFO2		0x12
>> diff --git a/include/ublksrv.h b/include/ublksrv.h
>> index d38bd46..f5deddb 100644
>> --- a/include/ublksrv.h
>> +++ b/include/ublksrv.h
>> @@ -106,6 +106,7 @@ struct ublksrv_tgt_info {
>>  	unsigned int nr_fds;
>>  	int fds[UBLKSRV_TGT_MAX_FDS];
>>  	void *tgt_data;
>> +	void *tgt_bpf_obj;
>>  
>>  	/*
>>  	 * Extra IO slots for each queue, target code can reserve some
>> @@ -263,6 +264,8 @@ struct ublksrv_tgt_type {
>>  	int (*init_queue)(const struct ublksrv_queue *, void **queue_data_ptr);
>>  	void (*deinit_queue)(const struct ublksrv_queue *);
>>  
>> +	int (*init_queue_bpf)(const struct ublksrv_dev *dev, const struct ublksrv_queue *q);
>> +
>>  	unsigned long reserved[5];
>>  };
>>  
>> @@ -318,6 +321,11 @@ extern void ublksrv_ctrl_prep_recovery(struct ublksrv_ctrl_dev *dev,
>>  		const char *recovery_jbuf);
>>  extern const char *ublksrv_ctrl_get_recovery_jbuf(const struct ublksrv_ctrl_dev *dev);
>>  
>> +extern void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,
>> +					  void *obj);
>> +extern int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
>> +				     int io_prep_fd, int io_submit_fd);
>> +
>>  /* ublksrv device ("/dev/ublkcN") level APIs */
>>  extern const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *
>>  		ctrl_dev);
>> diff --git a/include/ublksrv_priv.h b/include/ublksrv_priv.h
>> index 2996baa..8da8866 100644
>> --- a/include/ublksrv_priv.h
>> +++ b/include/ublksrv_priv.h
>> @@ -42,6 +42,7 @@ struct ublksrv_ctrl_dev {
>>  
>>  	const char *tgt_type;
>>  	const struct ublksrv_tgt_type *tgt_ops;
>> +	void *bpf_obj;
>>  
>>  	/*
>>  	 * default is UBLKSRV_RUN_DIR but can be specified via command line,
>> diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
>> index 234d31e..e0db7d9 100644
>> --- a/include/ublksrv_tgt.h
>> +++ b/include/ublksrv_tgt.h
>> @@ -9,6 +9,7 @@
>>  #include <getopt.h>
>>  #include <string.h>
>>  #include <stdarg.h>
>> +#include <limits.h>
>>  #include <sys/types.h>
>>  #include <sys/stat.h>
>>  #include <sys/ioctl.h>
>> diff --git a/lib/ublksrv.c b/lib/ublksrv.c
>> index 96bed95..110ccb3 100644
>> --- a/lib/ublksrv.c
>> +++ b/lib/ublksrv.c
>> @@ -603,6 +603,9 @@ skip_alloc_buf:
>>  		goto fail;
>>  	}
>>  
>> +	if (dev->tgt.ops->init_queue_bpf)
>> +		dev->tgt.ops->init_queue_bpf(tdev, local_to_tq(q));
>> +
>>  	ublksrv_dev_init_io_cmds(dev, q);
>>  
>>  	/*
>> @@ -723,6 +726,7 @@ const struct ublksrv_dev *ublksrv_dev_init(const struct ublksrv_ctrl_dev *ctrl_d
>>  	}
>>  
>>  	tgt->fds[0] = dev->cdev_fd;
>> +	tgt->tgt_bpf_obj = ctrl_dev->bpf_obj;
>>  
>>  	ret = ublksrv_tgt_init(dev, ctrl_dev->tgt_type, ctrl_dev->tgt_ops,
>>  			ctrl_dev->tgt_argc, ctrl_dev->tgt_argv);
>> diff --git a/lib/ublksrv_cmd.c b/lib/ublksrv_cmd.c
>> index 0d7265d..0101cb9 100644
>> --- a/lib/ublksrv_cmd.c
>> +++ b/lib/ublksrv_cmd.c
>> @@ -502,6 +502,27 @@ int ublksrv_ctrl_end_recovery(struct ublksrv_ctrl_dev *dev, int daemon_pid)
>>  	return ret;
>>  }
>>  
>> +int ublksrv_ctrl_reg_bpf_prog(struct ublksrv_ctrl_dev *dev,
>> +			      int io_prep_fd, int io_submit_fd)
>> +{
>> +	struct ublksrv_ctrl_cmd_data data = {
>> +		.cmd_op = UBLK_CMD_REG_BPF_PROG,
>> +		.flags = CTRL_CMD_HAS_DATA,
>> +	};
>> +	int ret;
>> +
>> +	data.data[0] = io_prep_fd;
>> +	data.data[1] = io_submit_fd;
>> +
>> +	ret = __ublksrv_ctrl_cmd(dev, &data);
>> +	return ret;
>> +}
>> +
>> +void ublksrv_ctrl_set_bpf_obj_info(struct ublksrv_ctrl_dev *dev,  void *obj)
>> +{
>> +	dev->bpf_obj = obj;
>> +}
>> +
>>  const struct ublksrv_ctrl_dev_info *ublksrv_ctrl_get_dev_info(
>>  		const struct ublksrv_ctrl_dev *dev)
>>  {
>> diff --git a/tgt_loop.cpp b/tgt_loop.cpp
>> index 79a65d3..b1568fe 100644
>> --- a/tgt_loop.cpp
>> +++ b/tgt_loop.cpp
>> @@ -4,7 +4,11 @@
>>  
>>  #include <poll.h>
>>  #include <sys/epoll.h>
>> +#include <linux/bpf.h>
>> +#include <bpf/bpf.h>
>> +#include <bpf/libbpf.h>
>>  #include "ublksrv_tgt.h"
>> +#include "bpf/.tmp/ublk.skel.h"
> Where is bpf/.tmp/ublk.skel.h?
>
>>  
>>  static bool backing_supports_discard(char *name)
>>  {
>> @@ -88,6 +92,20 @@ static int loop_recovery_tgt(struct ublksrv_dev *dev, int type)
>>  	return 0;
>>  }
>>  
>> +static int loop_init_queue_bpf(const struct ublksrv_dev *dev,
>> +			       const struct ublksrv_queue *q)
>> +{
>> +	int ret, q_id, ring_fd;
>> +	const struct ublksrv_tgt_info *tgt = &dev->tgt;
>> +	struct ublk_bpf *obj = (struct ublk_bpf*)tgt->tgt_bpf_obj;
>> +
>> +	q_id = q->q_id;
>> +	ring_fd = q->ring_ptr->ring_fd;
>> +	ret = bpf_map_update_elem(bpf_map__fd(obj->maps.uring_fd_map), &q_id,
>> +				  &ring_fd,  0);
>> +	return ret;
>> +}
>> +
>>  static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>>  		*argv[])
>>  {
>> @@ -125,6 +143,7 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>>  		},
>>  	};
>>  	bool can_discard = false;
>> +	struct ublk_bpf *bpf_obj;
>>  
>>  	strcpy(tgt_json.name, "loop");
>>  
>> @@ -218,6 +237,10 @@ static int loop_init_tgt(struct ublksrv_dev *dev, int type, int argc, char
>>  			jbuf = ublksrv_tgt_realloc_json_buf(dev, &jbuf_size);
>>  	} while (ret < 0);
>>  
>> +	if (tgt->tgt_bpf_obj) {
>> +		bpf_obj = (struct ublk_bpf *)tgt->tgt_bpf_obj;
>> +		bpf_obj->data->target_fd = tgt->fds[1];
>> +	}
>>  	return 0;
>>  }
>>  
>> @@ -252,9 +275,14 @@ static int loop_queue_tgt_io(const struct ublksrv_queue *q,
>>  		const struct ublk_io_data *data, int tag)
>>  {
>>  	const struct ublksrv_io_desc *iod = data->iod;
>> -	struct io_uring_sqe *sqe = io_uring_get_sqe(q->ring_ptr);
>> +	struct io_uring_sqe *sqe;
>>  	unsigned ublk_op = ublksrv_get_op(iod);
>>  
>> +	/* ebpf prog wil handle read/write requests. */
>> +	if ((ublk_op == UBLK_IO_OP_READ) || (ublk_op == UBLK_IO_OP_WRITE))
>> +		return 1;
>> +
>> +	sqe = io_uring_get_sqe(q->ring_ptr);
>>  	if (!sqe)
>>  		return 0;
>>  
>> @@ -374,6 +402,7 @@ struct ublksrv_tgt_type  loop_tgt_type = {
>>  	.type	= UBLKSRV_TGT_TYPE_LOOP,
>>  	.name	=  "loop",
>>  	.recovery_tgt = loop_recovery_tgt,
>> +	.init_queue_bpf = loop_init_queue_bpf,
>>  };
>>  
>>  static void tgt_loop_init() __attribute__((constructor));
>> diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
>> index 5ed328d..d3796cf 100644
>> --- a/ublksrv_tgt.cpp
>> +++ b/ublksrv_tgt.cpp
>> @@ -2,6 +2,7 @@
>>  
>>  #include "config.h"
>>  #include "ublksrv_tgt.h"
>> +#include "bpf/.tmp/ublk.skel.h"
> Same with above
>
>
> Thanks, 
> Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 3/3] ublk_drv: add ebpf support
  2023-02-16  8:11   ` Ming Lei
@ 2023-02-16 12:12     ` Xiaoguang Wang
  2023-02-17  3:02       ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-16 12:12 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang

hello,

> On Wed, Feb 15, 2023 at 08:41:22AM +0800, Xiaoguang Wang wrote:
>> Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
>> can use this helper to write ebpf prog to support ublk kernel & usersapce
>> zero copy, please see ublksrv test codes for more info.
>>
>>  	 */
>> +	if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
>> +		return rq_bytes;
> Can you explain a bit why READ isn't supported? Because WRITE zero
> copy is supposed to be supported easily with splice based approach,
> and I am more interested in READ zc actually.
No special reason, READ op can also be supported. I'll
add this support in patch set v2.
For this RFC patch set, I just tried to show the idea, so
I must admit that current codes are not mature enough :)

>
>> +
>>  	if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
>>  		return rq_bytes;
>>  
>> @@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
>>  	}
>>  }
>>  
>>
>> +	kbuf->bvec = bvec;
>> +	rq_for_each_bvec(tmp, rq, rq_iter) {
>> +		*bvec = tmp;
>> +		bvec++;
>> +	}
>> +
>> +	kbuf->count = blk_rq_bytes(rq);
>> +	kbuf->nr_bvecs = nr_bvec;
>> +	data->kbuf = kbuf;
>> +	return 0;
> bio/req bvec table is immutable, so here you can pass its reference
> to kbuf directly.
Yeah, thanks.

>
>> +}
>> +
>> +static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
>> +{
>> +	int err;
>> +	struct ublk_device *ub = ubq->dev;
>> +	struct bpf_prog *prog = ub->io_prep_prog;
>> +	struct ublk_io_bpf_ctx *bpf_ctx;
>> +
>> +	if (!prog)
>> +		return 0;
>> +
>> +	bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
>> +	if (!bpf_ctx)
>> +		return -EIO;
>> +
>> +	err = ublk_init_uring_kbuf(rq);
>> +	if (err < 0) {
>> +		kfree(bpf_ctx);
>> +		return -EIO;
>> +	}
>> +	bpf_ctx->ub = ub;
>> +	bpf_ctx->ctx.q_id = ubq->q_id;
>> +	bpf_ctx->ctx.tag = rq->tag;
>> +	bpf_ctx->ctx.op = req_op(rq);
>> +	bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
>> +	bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
> The above is for setting up target io parameter, which is supposed
> to be from userspace, cause it is result of user space logic. If
> these parameters are from kernel, the whole logic has to be done
> in io_prep_prog.
Yeah, it's designed that io_prep_prog implements user space
io logic.

>
>> +	bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
>> +
>> +	init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
>> +	if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
>> +		kfree(bpf_ctx);
> task_work_add() is only available in case of ublk builtin.
Yeah, I'm thinking how to work around it.

>
>> +	return 0;
>> +}
>> +
>>  static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  		const struct blk_mq_queue_data *bd)
>>  {
>> @@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  	if (unlikely(res != BLK_STS_OK))
>>  		return BLK_STS_IOERR;
>>  
>> +	/* Currently just for test. */
>> +	ublk_run_bpf_prog(ubq, rq);
> Can you explain the above comment a bit? When is the io_prep_prog called
> in the non-test version? Or can you post the non-test version in list
> for review.
Forgot to delete stale comments, sorry. I'm writing v2 patch set,

> Here it is the key for understanding the whole idea, especially when
> is io_prep_prog called finally? How to pass parameters to io_prep_prog?
Let me explain more about the design:
io_prep_prog has two types of parameters:
1) its call argument: struct ublk_bpf_ctx, see ublk.bpf.c.
ublk_bpf_ctx will describe one kernel io requests about
its op, qid, sectors info. io_prep_prog uses these info to
map target io.
2) ebpf map structure, user space daemon can use map
structure to pass much information from user space to
io_prep_prog, which will help it to initialize target io if necessary.

io_prep_prog is called when ublk_queue_rq() is called, this bpf
prog will initialize one or more sqes according to user logic, and
io_prep_prog will put these sqes in an ebpf map structure, then
execute a task_work_add() to notify ubq_daemon to execute
io_submit_prog. Note, we can not call io_uring_submit_sqe()
in task context that calls ublk_queue_rq(), that context does not
have io_uring instance owned by ubq_daemon.
Later ubq_daemon will call io_submit_prog to submit sqes.

>
> Given it is ebpf prog, I don't think any userspace parameter can be
> passed to io_prep_prog when submitting IO, that means all user logic has
> to be done inside io_prep_prog? If yes, not sure if it is one good way,
> cause ebpf prog is very limited programming environment, but the user
> logic could be as complicated as using btree to map io, or communicating
> with remote machine for figuring out the mapping. Loop is just the
> simplest direct mapping.
Currently, we can use ebpf map structure to pass user space
parameter to io_prep_prog. Also I agree with you that complicated
logic maybe hard to be implemented in ebpf prog, hope ebpf
community will improve this situation gradually.

For userspace target implementations I met so far, they just use
userspace block device solutions to visit distributed filesystem,
involves socket programming and have simple map io logic. We
can prepare socket fd in ebpf map structure, and these map io
logic should be easily implemented in ebpf prog, though I don't
apply this ebpf method to our internal business yet.

Thanks for review.

Regards,
Xiaoguang Wang

>
>
> Thanks, 
> Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 3/3] ublk_drv: add ebpf support
  2023-02-16 12:12     ` Xiaoguang Wang
@ 2023-02-17  3:02       ` Ming Lei
  2023-02-17 10:46         ` Ming Lei
  2023-02-22 14:13         ` Xiaoguang Wang
  0 siblings, 2 replies; 18+ messages in thread
From: Ming Lei @ 2023-02-17  3:02 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang, ming.lei

On Thu, Feb 16, 2023 at 08:12:18PM +0800, Xiaoguang Wang wrote:
> hello,
> 
> > On Wed, Feb 15, 2023 at 08:41:22AM +0800, Xiaoguang Wang wrote:
> >> Currenly only one bpf_ublk_queue_sqe() ebpf is added, ublksrv target
> >> can use this helper to write ebpf prog to support ublk kernel & usersapce
> >> zero copy, please see ublksrv test codes for more info.
> >>
> >>  	 */
> >> +	if ((req_op(req) == REQ_OP_WRITE) && ub->io_prep_prog)
> >> +		return rq_bytes;
> > Can you explain a bit why READ isn't supported? Because WRITE zero
> > copy is supposed to be supported easily with splice based approach,
> > and I am more interested in READ zc actually.
> No special reason, READ op can also be supported. I'll
> add this support in patch set v2.
> For this RFC patch set, I just tried to show the idea, so
> I must admit that current codes are not mature enough :)

OK.

> 
> >
> >> +
> >>  	if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
> >>  		return rq_bytes;
> >>  
> >> @@ -860,6 +921,89 @@ static void ublk_queue_cmd(struct ublk_queue *ubq, struct request *rq)
> >>  	}
> >>  }
> >>  
> >>
> >> +	kbuf->bvec = bvec;
> >> +	rq_for_each_bvec(tmp, rq, rq_iter) {
> >> +		*bvec = tmp;
> >> +		bvec++;
> >> +	}
> >> +
> >> +	kbuf->count = blk_rq_bytes(rq);
> >> +	kbuf->nr_bvecs = nr_bvec;
> >> +	data->kbuf = kbuf;
> >> +	return 0;
> > bio/req bvec table is immutable, so here you can pass its reference
> > to kbuf directly.
> Yeah, thanks.

Also if this request has multiple bios, either you need to submit
multple sqes or copy all bvec into single table. And in case of single bio,
the table reference can be used directly.

> 
> >
> >> +}
> >> +
> >> +static int ublk_run_bpf_prog(struct ublk_queue *ubq, struct request *rq)
> >> +{
> >> +	int err;
> >> +	struct ublk_device *ub = ubq->dev;
> >> +	struct bpf_prog *prog = ub->io_prep_prog;
> >> +	struct ublk_io_bpf_ctx *bpf_ctx;
> >> +
> >> +	if (!prog)
> >> +		return 0;
> >> +
> >> +	bpf_ctx = kmalloc(sizeof(struct ublk_io_bpf_ctx), GFP_NOIO);
> >> +	if (!bpf_ctx)
> >> +		return -EIO;
> >> +
> >> +	err = ublk_init_uring_kbuf(rq);
> >> +	if (err < 0) {
> >> +		kfree(bpf_ctx);
> >> +		return -EIO;
> >> +	}
> >> +	bpf_ctx->ub = ub;
> >> +	bpf_ctx->ctx.q_id = ubq->q_id;
> >> +	bpf_ctx->ctx.tag = rq->tag;
> >> +	bpf_ctx->ctx.op = req_op(rq);
> >> +	bpf_ctx->ctx.nr_sectors = blk_rq_sectors(rq);
> >> +	bpf_ctx->ctx.start_sector = blk_rq_pos(rq);
> > The above is for setting up target io parameter, which is supposed
> > to be from userspace, cause it is result of user space logic. If
> > these parameters are from kernel, the whole logic has to be done
> > in io_prep_prog.
> Yeah, it's designed that io_prep_prog implements user space
> io logic.

That could be the biggest weakness of this approach, because people
really want to implement complicated logic in userspace, which should
be the biggest value of ublk, but now seems you move kernel C
programming into ebpf userspace programming, I don't think ebpf
is good at handling complicated userspace logic.

> 
> >
> >> +	bpf_prog_run_pin_on_cpu(prog, bpf_ctx);
> >> +
> >> +	init_task_work(&bpf_ctx->work, ublk_bpf_io_submit_fn);
> >> +	if (task_work_add(ubq->ubq_daemon, &bpf_ctx->work, TWA_SIGNAL_NO_IPI))
> >> +		kfree(bpf_ctx);
> > task_work_add() is only available in case of ublk builtin.
> Yeah, I'm thinking how to work around it.
> 
> >
> >> +	return 0;
> >> +}
> >> +
> >>  static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>  		const struct blk_mq_queue_data *bd)
> >>  {
> >> @@ -872,6 +1016,9 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>  	if (unlikely(res != BLK_STS_OK))
> >>  		return BLK_STS_IOERR;
> >>  
> >> +	/* Currently just for test. */
> >> +	ublk_run_bpf_prog(ubq, rq);
> > Can you explain the above comment a bit? When is the io_prep_prog called
> > in the non-test version? Or can you post the non-test version in list
> > for review.
> Forgot to delete stale comments, sorry. I'm writing v2 patch set,

OK, got it, so looks ublk_run_bpf_prog is designed to run two progs
loaded from two control commands.

> 
> > Here it is the key for understanding the whole idea, especially when
> > is io_prep_prog called finally? How to pass parameters to io_prep_prog?
> Let me explain more about the design:
> io_prep_prog has two types of parameters:
> 1) its call argument: struct ublk_bpf_ctx, see ublk.bpf.c.
> ublk_bpf_ctx will describe one kernel io requests about
> its op, qid, sectors info. io_prep_prog uses these info to
> map target io.
> 2) ebpf map structure, user space daemon can use map
> structure to pass much information from user space to
> io_prep_prog, which will help it to initialize target io if necessary.
> 
> io_prep_prog is called when ublk_queue_rq() is called, this bpf
> prog will initialize one or more sqes according to user logic, and
> io_prep_prog will put these sqes in an ebpf map structure, then
> execute a task_work_add() to notify ubq_daemon to execute
> io_submit_prog. Note, we can not call io_uring_submit_sqe()
> in task context that calls ublk_queue_rq(), that context does not
> have io_uring instance owned by ubq_daemon.
> Later ubq_daemon will call io_submit_prog to submit sqes.

Submitting sqe from kernel looks interesting, but I guess
performance may be hurt, given plugging(batching) can't be applied
any more, which is supposed to affect io perf a lot.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 3/3] ublk_drv: add ebpf support
  2023-02-17  3:02       ` Ming Lei
@ 2023-02-17 10:46         ` Ming Lei
  2023-02-22 14:13         ` Xiaoguang Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2023-02-17 10:46 UTC (permalink / raw)
  To: Xiaoguang Wang
  Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang, ming.lei

On Fri, Feb 17, 2023 at 11:02:14AM +0800, Ming Lei wrote:
> On Thu, Feb 16, 2023 at 08:12:18PM +0800, Xiaoguang Wang wrote:
> > hello,

...

> > io_prep_prog is called when ublk_queue_rq() is called, this bpf
> > prog will initialize one or more sqes according to user logic, and
> > io_prep_prog will put these sqes in an ebpf map structure, then
> > execute a task_work_add() to notify ubq_daemon to execute
> > io_submit_prog. Note, we can not call io_uring_submit_sqe()
> > in task context that calls ublk_queue_rq(), that context does not
> > have io_uring instance owned by ubq_daemon.
> > Later ubq_daemon will call io_submit_prog to submit sqes.
> 
> Submitting sqe from kernel looks interesting, but I guess
> performance may be hurt, given plugging(batching) can't be applied
> any more, which is supposed to affect io perf a lot.

If submitting SQE in kernel is really doable, maybe we can add another
command, such as, UBLK_IO_SUBMIT_SQE(just like UBLK_IO_NEED_GET_DATA),
and pass the built SQE(which represents part of user logic result) as
io_uring command payload, and ask ublk driver to build buffer for this
SQE, then submit this SQE in kernel.

But there is SQE order problem, net usually requires SQEs to be linked
and submitted in order, with this way, it becomes not easy to maintain
SQEs order(some linked in user, and some in kernel).

Thanks,
Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 3/3] ublk_drv: add ebpf support
  2023-02-17  3:02       ` Ming Lei
  2023-02-17 10:46         ` Ming Lei
@ 2023-02-22 14:13         ` Xiaoguang Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Xiaoguang Wang @ 2023-02-22 14:13 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, io-uring, bpf, axboe, asml.silence, ZiyangZhang

hi,

I spent some time to write v2, especially think about how to work
around task_work_add is not exported, so sorry for late response.
>
>>> The above is for setting up target io parameter, which is supposed
>>> to be from userspace, cause it is result of user space logic. If
>>> these parameters are from kernel, the whole logic has to be done
>>> in io_prep_prog.
>> Yeah, it's designed that io_prep_prog implements user space
>> io logic.
> That could be the biggest weakness of this approach, because people
> really want to implement complicated logic in userspace, which should
> be the biggest value of ublk, but now seems you move kernel C
> programming into ebpf userspace programming, I don't think ebpf
> is good at handling complicated userspace logic.
Absolutely agree with you, ebpf has strict programming rules,
I spent more time than I had thought at startup for support loop
target ebpf prog(ublk.bpf.c). Later I'll try to collaborate with my
colleagues, to see whether we can program their userspace logic
into ebpf prog or partially.
 
>> io_prep_prog is called when ublk_queue_rq() is called, this bpf
>> prog will initialize one or more sqes according to user logic, and
>> io_prep_prog will put these sqes in an ebpf map structure, then
>> execute a task_work_add() to notify ubq_daemon to execute
>> io_submit_prog. Note, we can not call io_uring_submit_sqe()
>> in task context that calls ublk_queue_rq(), that context does not
>> have io_uring instance owned by ubq_daemon.
>> Later ubq_daemon will call io_submit_prog to submit sqes.
> Submitting sqe from kernel looks interesting, but I guess
> performance may be hurt, given plugging(batching) can't be applied
> any more, which is supposed to affect io perf a lot.
Yes, agree, but I didn't have much time to improve this yet.
Currently, I mainly try to use this feature on large ios, to
reduce memory copy overhead, which consumes much
cpu resource, our clients really hope we can reduce it.

Regards,
Xiaoguang Wang

>
>
>
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
@ 2023-02-15 18:57 kernel test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-02-15 18:57 UTC (permalink / raw)
  To: oe-kbuild; +Cc: lkp

:::::: 
:::::: Manual check reason: "low confidence static check warning: include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression"
:::::: 

BCC: lkp@intel.com
CC: oe-kbuild-all@lists.linux.dev
In-Reply-To: <20230215004122.28917-3-xiaoguang.wang@linux.alibaba.com>
References: <20230215004122.28917-3-xiaoguang.wang@linux.alibaba.com>
TO: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

Hi Xiaoguang,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on bpf/master]
[also build test WARNING on axboe-block/for-next linus/master v6.2-rc8]
[cannot apply to bpf-next/master next-20230215]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
patch link:    https://lore.kernel.org/r/20230215004122.28917-3-xiaoguang.wang%40linux.alibaba.com
patch subject: [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel
:::::: branch date: 18 hours ago
:::::: commit date: 18 hours ago
config: sh-randconfig-s053-20230212 (https://download.01.org/0day-ci/archive/20230216/202302160254.XFXTSlCj-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/0d433a1cf7f666e5596514da7fab92f2691463ad
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Xiaoguang-Wang/bpf-add-UBLK-program-type/20230215-084242
        git checkout 0d433a1cf7f666e5596514da7fab92f2691463ad
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sh olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=sh SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/r/202302160254.XFXTSlCj-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   io_uring/xattr.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/xattr.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/io_uring.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/io_uring.c: note: in included file (through include/trace/events/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/nop.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/nop.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/fs.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/fs.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/splice.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/splice.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/advise.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/advise.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/sync.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/sync.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/filetable.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/filetable.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/openclose.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/openclose.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/uring_cmd.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/uring_cmd.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/statx.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/statx.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/epoll.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/epoll.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/msg_ring.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/msg_ring.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/net.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/net.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/timeout.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/timeout.c: note: in included file (through include/trace/events/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/sqpoll.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/sqpoll.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/fdinfo.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/fdinfo.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/tctx.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/tctx.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/poll.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/poll.c: note: in included file (through include/trace/events/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/kbuf.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/kbuf.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/cancel.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/cancel.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/opdef.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/opdef.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/rsrc.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/rsrc.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/notif.c: note: in included file (through io_uring/io_uring.h):
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/notif.c: note: in included file (through io_uring/io_uring.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   io_uring/io-wq.c: note: in included file:
   io_uring/slist.h:138:29: sparse: sparse: no newline at end of file
   io_uring/io-wq.c: note: in included file (through io_uring/io-wq.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
--
   kernel/sched/core.c: note: in included file (through kernel/sched/../../io_uring/io-wq.h):
>> include/linux/io_uring_types.h:472:35: sparse: sparse: bad constant expression
   kernel/sched/core.c:7024:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:7024:17: sparse:    struct task_struct *
   kernel/sched/core.c:7024:17: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:7240:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/core.c:7240:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/core.c:7240:22: sparse:    struct task_struct *
   kernel/sched/core.c: note: in included file:
   kernel/sched/sched.h:2068:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2226:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2226:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2068:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2226:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2068:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2226:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2068:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2226:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2068:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2068:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2068:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2226:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2226:9: sparse:    struct task_struct *

vim +472 include/linux/io_uring_types.h

e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  406  
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  407  enum {
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  408  	/* ctx owns file */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  409  	REQ_F_FIXED_FILE	= BIT(REQ_F_FIXED_FILE_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  410  	/* drain existing IO first */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  411  	REQ_F_IO_DRAIN		= BIT(REQ_F_IO_DRAIN_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  412  	/* linked sqes */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  413  	REQ_F_LINK		= BIT(REQ_F_LINK_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  414  	/* doesn't sever on completion < 0 */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  415  	REQ_F_HARDLINK		= BIT(REQ_F_HARDLINK_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  416  	/* IOSQE_ASYNC */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  417  	REQ_F_FORCE_ASYNC	= BIT(REQ_F_FORCE_ASYNC_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  418  	/* IOSQE_BUFFER_SELECT */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  419  	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  420  	/* IOSQE_CQE_SKIP_SUCCESS */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  421  	REQ_F_CQE_SKIP		= BIT(REQ_F_CQE_SKIP_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  422  
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  423  	/* fail rest of links */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  424  	REQ_F_FAIL		= BIT(REQ_F_FAIL_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  425  	/* on inflight list, should be cancelled and waited on exit reliably */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  426  	REQ_F_INFLIGHT		= BIT(REQ_F_INFLIGHT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  427  	/* read/write uses file position */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  428  	REQ_F_CUR_POS		= BIT(REQ_F_CUR_POS_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  429  	/* must not punt to workers */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  430  	REQ_F_NOWAIT		= BIT(REQ_F_NOWAIT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  431  	/* has or had linked timeout */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  432  	REQ_F_LINK_TIMEOUT	= BIT(REQ_F_LINK_TIMEOUT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  433  	/* needs cleanup */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  434  	REQ_F_NEED_CLEANUP	= BIT(REQ_F_NEED_CLEANUP_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  435  	/* already went through poll handler */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  436  	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  437  	/* buffer already selected */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  438  	REQ_F_BUFFER_SELECTED	= BIT(REQ_F_BUFFER_SELECTED_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  439  	/* buffer selected from ring, needs commit */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  440  	REQ_F_BUFFER_RING	= BIT(REQ_F_BUFFER_RING_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  441  	/* caller should reissue async */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  442  	REQ_F_REISSUE		= BIT(REQ_F_REISSUE_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  443  	/* supports async reads/writes */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  444  	REQ_F_SUPPORT_NOWAIT	= BIT(REQ_F_SUPPORT_NOWAIT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  445  	/* regular file */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  446  	REQ_F_ISREG		= BIT(REQ_F_ISREG_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  447  	/* has creds assigned */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  448  	REQ_F_CREDS		= BIT(REQ_F_CREDS_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  449  	/* skip refcounting if not set */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  450  	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  451  	/* there is a linked timeout that has to be armed */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  452  	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  453  	/* ->async_data allocated */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  454  	REQ_F_ASYNC_DATA	= BIT(REQ_F_ASYNC_DATA_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  455  	/* don't post CQEs while failing linked requests */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  456  	REQ_F_SKIP_LINK_CQES	= BIT(REQ_F_SKIP_LINK_CQES_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  457  	/* single poll may be active */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  458  	REQ_F_SINGLE_POLL	= BIT(REQ_F_SINGLE_POLL_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  459  	/* double poll may active */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  460  	REQ_F_DOUBLE_POLL	= BIT(REQ_F_DOUBLE_POLL_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  461  	/* request has already done partial IO */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  462  	REQ_F_PARTIAL_IO	= BIT(REQ_F_PARTIAL_IO_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  463  	/* fast poll multishot mode */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  464  	REQ_F_APOLL_MULTISHOT	= BIT(REQ_F_APOLL_MULTISHOT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  465  	/* ->extra1 and ->extra2 are initialised */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  466  	REQ_F_CQE32_INIT	= BIT(REQ_F_CQE32_INIT_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  467  	/* recvmsg special flag, clear EPOLLIN */
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  468  	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
9ca9fb24d5febc io_uring/io_uring_types.h      Pavel Begunkov 2022-06-16  469  	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
9ca9fb24d5febc io_uring/io_uring_types.h      Pavel Begunkov 2022-06-16  470  	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
0d433a1cf7f666 include/linux/io_uring_types.h Xiaoguang Wang 2023-02-15  471  	/* buffer comes from kernel */
0d433a1cf7f666 include/linux/io_uring_types.h Xiaoguang Wang 2023-02-15 @472  	REQ_F_KBUF		= BIT(REQ_F_KBUF_BIT),
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  473  };
e27f928ee1cb06 io_uring/io_uring_types.h      Jens Axboe     2022-05-24  474  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-02-22 14:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-15  0:41 [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Xiaoguang Wang
2023-02-15  0:41 ` [RFC 1/3] bpf: add UBLK program type Xiaoguang Wang
2023-02-15 15:52   ` kernel test robot
2023-02-15  0:41 ` [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel Xiaoguang Wang
2023-02-15 14:17   ` kernel test robot
2023-02-15 14:50   ` kernel test robot
2023-02-15 15:41   ` kernel test robot
2023-02-15  0:41 ` [RFC 3/3] ublk_drv: add ebpf support Xiaoguang Wang
2023-02-16  8:11   ` Ming Lei
2023-02-16 12:12     ` Xiaoguang Wang
2023-02-17  3:02       ` Ming Lei
2023-02-17 10:46         ` Ming Lei
2023-02-22 14:13         ` Xiaoguang Wang
2023-02-15  0:46 ` [UBLKSRV] Add " Xiaoguang Wang
2023-02-16  8:28   ` Ming Lei
2023-02-16  9:17     ` Xiaoguang Wang
2023-02-15  8:40 ` [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk Ziyang Zhang
2023-02-15 18:57 [RFC 2/3] io_uring: enable io_uring to submit sqes located in kernel kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.