linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] block: use eBPF to redirect IO completion
@ 2019-10-14 12:28 Hou Tao
  2019-10-14 12:28 ` [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF Hou Tao
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Hou Tao @ 2019-10-14 12:28 UTC (permalink / raw)
  To: linux-block, bpf, netdev, axboe, ast
  Cc: hare, osandov, ming.lei, damien.lemoal, bvanassche, daniel,
	kafai, songliubraving, yhs

For network stack, RPS, namely Receive Packet Steering, is used to
distribute network protocol processing from hardware-interrupted CPU
to specific CPUs and alleviating soft-irq load of the interrupted CPU.

For block layer, soft-irq (for single queue device) or hard-irq
(for multiple queue device) is used to handle IO completion, so
RPS will be useful when the soft-irq load or the hard-irq load
of a specific CPU is too high, or a specific CPU set is required
to handle IO completion.

Instead of setting the CPU set used for handling IO completion
through sysfs or procfs, we can attach an eBPF program to the
request-queue, provide some useful info (e.g., the CPU
which submits the request) to the program, and let the program
decides the proper CPU for IO completion handling.

In order to demonostrate the effect of IO completion redirection,
a test programm is built to redirect the IO completion handling
to all online CPUs or a specific CPU set:

	./test_blkdev_ccpu -d /dev/vda
or
	./test_blkdev_ccpu -d /dev/nvme0n1 -s 4,8,10-13

However I am still trying to find out a killer scenario for
the eBPF redirection, so suggestions and comments are welcome.

Regards,
Tao

Hou Tao (2):
  block: add support for redirecting IO completion through eBPF
  selftests/bpf: add test program for redirecting IO completion CPU

 block/Makefile                                |   2 +-
 block/blk-bpf.c                               | 127 +++++++++
 block/blk-mq.c                                |  22 +-
 block/blk-softirq.c                           |  27 +-
 include/linux/blkdev.h                        |   3 +
 include/linux/bpf_blkdev.h                    |   9 +
 include/linux/bpf_types.h                     |   1 +
 include/uapi/linux/bpf.h                      |   2 +
 kernel/bpf/syscall.c                          |   9 +
 tools/include/uapi/linux/bpf.h                |   2 +
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 tools/testing/selftests/bpf/Makefile          |   1 +
 .../selftests/bpf/progs/blkdev_ccpu_rr.c      |  66 +++++
 .../testing/selftests/bpf/test_blkdev_ccpu.c  | 246 ++++++++++++++++++
 15 files changed, 507 insertions(+), 12 deletions(-)
 create mode 100644 block/blk-bpf.c
 create mode 100644 include/linux/bpf_blkdev.h
 create mode 100644 tools/testing/selftests/bpf/progs/blkdev_ccpu_rr.c
 create mode 100644 tools/testing/selftests/bpf/test_blkdev_ccpu.c

-- 
2.22.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
  2019-10-14 12:28 [RFC PATCH 0/2] block: use eBPF to redirect IO completion Hou Tao
@ 2019-10-14 12:28 ` Hou Tao
  2019-10-15 21:04   ` Alexei Starovoitov
  2019-10-14 12:28 ` [RFC PATCH 2/2] selftests/bpf: add test program for redirecting IO completion CPU Hou Tao
  2019-10-15  1:20 ` [RFC PATCH 0/2] block: use eBPF to redirect IO completion Bob Liu
  2 siblings, 1 reply; 9+ messages in thread
From: Hou Tao @ 2019-10-14 12:28 UTC (permalink / raw)
  To: linux-block, bpf, netdev, axboe, ast
  Cc: hare, osandov, ming.lei, damien.lemoal, bvanassche, daniel,
	kafai, songliubraving, yhs

For network stack, RPS, namely Receive Packet Steering, is used to
distribute network protocol processing from hardware-interrupted CPU
to specific CPUs and alleviating soft-irq load of the interrupted CPU.

For block layer, soft-irq (for single queue device) or hard-irq
(for multiple queue device) is used to handle IO completion, so
RPS will be useful when the soft-irq load or the hard-irq load
of a specific CPU is too high, or a specific CPU set is required
to handle IO completion.

Instead of setting the CPU set used for handling IO completion
through sysfs or procfs, we can attach an eBPF program to the
request-queue, provide some useful info (e.g., the CPU
which submits the request) to the program, and let the program
decides the proper CPU for IO completion handling.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 block/Makefile             |   2 +-
 block/blk-bpf.c            | 127 +++++++++++++++++++++++++++++++++++++
 block/blk-mq.c             |  22 +++++--
 block/blk-softirq.c        |  27 ++++++--
 include/linux/blkdev.h     |   3 +
 include/linux/bpf_blkdev.h |   9 +++
 include/linux/bpf_types.h  |   1 +
 include/uapi/linux/bpf.h   |   2 +
 kernel/bpf/syscall.c       |   9 +++
 9 files changed, 190 insertions(+), 12 deletions(-)
 create mode 100644 block/blk-bpf.c
 create mode 100644 include/linux/bpf_blkdev.h

diff --git a/block/Makefile b/block/Makefile
index 9ef57ace90d4..0adb0f655e8c 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
 			blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
 			genhd.o partition-generic.o ioprio.o \
-			badblocks.o partitions/ blk-rq-qos.o
+			badblocks.o partitions/ blk-rq-qos.o blk-bpf.o
 
 obj-$(CONFIG_BOUNCE)		+= bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST)	+= scsi_ioctl.o
diff --git a/block/blk-bpf.c b/block/blk-bpf.c
new file mode 100644
index 000000000000..d9e3b1caead4
--- /dev/null
+++ b/block/blk-bpf.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Hou Tao <houtao1@huawei.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/fs.h>
+#include <linux/bpf_blkdev.h>
+#include <linux/blkdev.h>
+
+extern const struct file_operations def_blk_fops;
+
+static DEFINE_SPINLOCK(blkdev_bpf_lock);
+
+const struct bpf_prog_ops blkdev_prog_ops = {
+};
+
+static const struct bpf_func_proto *
+blkdev_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_get_numa_node_id:
+		return &bpf_get_numa_node_id_proto;
+	default:
+		return NULL;
+	}
+}
+
+const struct bpf_verifier_ops blkdev_verifier_ops = {
+	.get_func_proto = blkdev_prog_func_proto,
+};
+
+static struct request_queue *blkdev_rq_by_file(struct file *filp)
+{
+	struct block_device *bdev;
+
+	if (filp->f_op != &def_blk_fops)
+		return ERR_PTR(-EINVAL);
+
+	bdev = I_BDEV(filp->f_mapping->host);
+
+	return bdev->bd_queue;
+}
+
+int blkdev_bpf_prog_attach(const union bpf_attr *attr,
+		enum bpf_prog_type ptype, struct bpf_prog *prog)
+{
+	int ret = 0;
+	struct file *filp;
+	struct request_queue *rq;
+
+	filp = fget(attr->target_fd);
+	if (!filp) {
+		ret = -EINVAL;
+		goto fget_err;
+	}
+
+	rq = blkdev_rq_by_file(filp);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		goto to_rq_err;
+	}
+
+	spin_lock(&blkdev_bpf_lock);
+	if (rq->prog) {
+		ret = -EBUSY;
+		goto set_prog_err;
+	}
+
+	rcu_assign_pointer(rq->prog, prog);
+
+set_prog_err:
+	spin_unlock(&blkdev_bpf_lock);
+to_rq_err:
+	fput(filp);
+fget_err:
+	return ret;
+}
+
+int blkdev_bpf_prog_detach(const union bpf_attr *attr)
+{
+	int ret = 0;
+	struct file *filp;
+	struct request_queue *rq;
+	struct bpf_prog *old_prog;
+
+	filp = fget(attr->target_fd);
+	if (!filp) {
+		ret = -EINVAL;
+		goto fget_err;
+	}
+
+	rq = blkdev_rq_by_file(filp);
+	if (IS_ERR(rq)) {
+		ret = PTR_ERR(rq);
+		goto to_rq_err;
+	}
+
+	old_prog = NULL;
+	spin_lock(&blkdev_bpf_lock);
+	if (!rq->prog) {
+		ret = -ENODATA;
+		goto clr_prog_err;
+	}
+	rcu_swap_protected(rq->prog, old_prog, 1);
+
+clr_prog_err:
+	spin_unlock(&blkdev_bpf_lock);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+to_rq_err:
+	fput(filp);
+fget_err:
+	return ret;
+}
+
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 20a49be536b5..5ac6fe6dbcd0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -26,6 +26,7 @@
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
 #include <linux/prefetch.h>
+#include <linux/filter.h>
 
 #include <trace/events/block.h>
 
@@ -584,6 +585,9 @@ static void __blk_mq_complete_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	bool shared = false;
 	int cpu;
+	int ccpu;
+	int bpf_ccpu = -1;
+	struct bpf_prog *prog;
 
 	WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
 	/*
@@ -610,15 +614,25 @@ static void __blk_mq_complete_request(struct request *rq)
 		return;
 	}
 
+	rcu_read_lock();
+	prog = rcu_dereference_protected(q->prog, 1);
+	if (prog)
+		bpf_ccpu = BPF_PROG_RUN(q->prog, NULL);
+	rcu_read_unlock();
+
 	cpu = get_cpu();
-	if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
-		shared = cpus_share_cache(cpu, ctx->cpu);
+	if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) {
+		ccpu = ctx->cpu;
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
+			shared = cpus_share_cache(cpu, ctx->cpu);
+	} else
+		ccpu = bpf_ccpu;
 
-	if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) {
+	if (cpu != ccpu && !shared && cpu_online(ccpu)) {
 		rq->csd.func = __blk_mq_complete_request_remote;
 		rq->csd.info = rq;
 		rq->csd.flags = 0;
-		smp_call_function_single_async(ctx->cpu, &rq->csd);
+		smp_call_function_single_async(ccpu, &rq->csd);
 	} else {
 		q->mq_ops->complete(rq);
 	}
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 457d9ba3eb20..1139a5352a59 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -11,6 +11,7 @@
 #include <linux/cpu.h>
 #include <linux/sched.h>
 #include <linux/sched/topology.h>
+#include <linux/filter.h>
 
 #include "blk.h"
 
@@ -101,20 +102,32 @@ void __blk_complete_request(struct request *req)
 	int cpu, ccpu = req->mq_ctx->cpu;
 	unsigned long flags;
 	bool shared = false;
+	int bpf_ccpu = -1;
+	struct bpf_prog *prog;
 
 	BUG_ON(!q->mq_ops->complete);
 
-	local_irq_save(flags);
-	cpu = smp_processor_id();
+	rcu_read_lock();
+	prog = rcu_dereference_protected(q->prog, 1);
+	if (prog)
+		bpf_ccpu = BPF_PROG_RUN(q->prog, NULL);
+	rcu_read_unlock();
 
 	/*
-	 * Select completion CPU
+	 * Select completion CPU.
+	 * If a valid CPU number is returned by eBPF program, use it directly.
 	 */
-	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && ccpu != -1) {
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
-			shared = cpus_share_cache(cpu, ccpu);
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) {
+		if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) &&
+			ccpu != -1) {
+			if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
+				shared = cpus_share_cache(cpu, ccpu);
+		} else
+			ccpu = cpu;
 	} else
-		ccpu = cpu;
+		ccpu = bpf_ccpu;
 
 	/*
 	 * If current CPU and requested CPU share a cache, run the softirq on
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d9db32fb75ee..849589c3c51c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -397,6 +397,8 @@ static inline int blkdev_reset_zones_ioctl(struct block_device *bdev,
 
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+struct bpf_prog;
+
 struct request_queue {
 	struct request		*last_merge;
 	struct elevator_queue	*elevator;
@@ -590,6 +592,7 @@ struct request_queue {
 
 #define BLK_MAX_WRITE_HINTS	5
 	u64			write_hints[BLK_MAX_WRITE_HINTS];
+	struct bpf_prog __rcu *prog;
 };
 
 #define QUEUE_FLAG_STOPPED	0	/* queue is stopped */
diff --git a/include/linux/bpf_blkdev.h b/include/linux/bpf_blkdev.h
new file mode 100644
index 000000000000..0777428bc6e2
--- /dev/null
+++ b/include/linux/bpf_blkdev.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __BPF_BLKDEV_H__
+#define __BPF_BLKDEV_H__
+
+extern int blkdev_bpf_prog_attach(const union bpf_attr *attr,
+		enum bpf_prog_type ptype, struct bpf_prog *prog);
+extern int blkdev_bpf_prog_detach(const union bpf_attr *attr);
+
+#endif /* !__BPF_BLKDEV_H__ */
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 36a9c2325176..008facd336e5 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -38,6 +38,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_BLKDEV, blkdev)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77c6be96d676..36aa35e29be2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -173,6 +173,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
+	BPF_PROG_TYPE_BLKDEV,
 };
 
 enum bpf_attach_type {
@@ -199,6 +200,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP6_RECVMSG,
 	BPF_CGROUP_GETSOCKOPT,
 	BPF_CGROUP_SETSOCKOPT,
+	BPF_BLKDEV_IOC_CPU,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 82eabd4e38ad..9724c0809f21 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4,6 +4,7 @@
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <linux/bpf_lirc.h>
+#include <linux/bpf_blkdev.h>
 #include <linux/btf.h>
 #include <linux/syscalls.h>
 #include <linux/slab.h>
@@ -1942,6 +1943,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_SETSOCKOPT:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
 		break;
+	case BPF_BLKDEV_IOC_CPU:
+		ptype = BPF_PROG_TYPE_BLKDEV;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1966,6 +1970,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_BLKDEV:
+		ret = blkdev_bpf_prog_attach(attr, ptype, prog);
+		break;
 	default:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
@@ -2029,6 +2036,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_SETSOCKOPT:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
 		break;
+	case BPF_BLKDEV_IOC_CPU:
+		return blkdev_bpf_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
-- 
2.22.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] selftests/bpf: add test program for redirecting IO completion CPU
  2019-10-14 12:28 [RFC PATCH 0/2] block: use eBPF to redirect IO completion Hou Tao
  2019-10-14 12:28 ` [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF Hou Tao
@ 2019-10-14 12:28 ` Hou Tao
  2019-10-15  1:20 ` [RFC PATCH 0/2] block: use eBPF to redirect IO completion Bob Liu
  2 siblings, 0 replies; 9+ messages in thread
From: Hou Tao @ 2019-10-14 12:28 UTC (permalink / raw)
  To: linux-block, bpf, netdev, axboe, ast
  Cc: hare, osandov, ming.lei, damien.lemoal, bvanassche, daniel,
	kafai, songliubraving, yhs

A simple round-robin strategy is implemented to redirect the IO
completion handling to all online CPUs or specific CPU set cyclically.

Using the following command to distribute the IO completion of vda
to all online CPUs:

	./test_blkdev_ccpu -d /dev/vda

And the following command to distribute the IO completion of nvme0n1
to a specific CPU set:
	./test_blkdev_ccpu -d /dev/nvme0n1 -s 4,8,10-13

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/include/uapi/linux/bpf.h                |   2 +
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 tools/testing/selftests/bpf/Makefile          |   1 +
 .../selftests/bpf/progs/blkdev_ccpu_rr.c      |  66 +++++
 .../testing/selftests/bpf/test_blkdev_ccpu.c  | 246 ++++++++++++++++++
 6 files changed, 317 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/blkdev_ccpu_rr.c
 create mode 100644 tools/testing/selftests/bpf/test_blkdev_ccpu.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 77c6be96d676..36aa35e29be2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -173,6 +173,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
 	BPF_PROG_TYPE_CGROUP_SOCKOPT,
+	BPF_PROG_TYPE_BLKDEV,
 };
 
 enum bpf_attach_type {
@@ -199,6 +200,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP6_RECVMSG,
 	BPF_CGROUP_GETSOCKOPT,
 	BPF_CGROUP_SETSOCKOPT,
+	BPF_BLKDEV_IOC_CPU,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e0276520171b..5a849d6d30be 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -3579,6 +3579,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_PERF_EVENT:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_BLKDEV:
 		return false;
 	case BPF_PROG_TYPE_KPROBE:
 	default:
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 4b0b0364f5fc..311e13e778a3 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -102,6 +102,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
 	case BPF_PROG_TYPE_CGROUP_SOCKOPT:
+	case BPF_PROG_TYPE_BLKDEV:
 	default:
 		break;
 	}
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 6889c19a628c..6a36234adfea 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -30,6 +30,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_cgroup_storage test_select_reuseport test_section_names \
 	test_netcnt test_tcpnotify_user test_sock_fields test_sysctl test_hashmap \
 	test_btf_dump test_cgroup_attach xdping
+TEST_GEN_PROGS += test_blkdev_ccpu
 
 BPF_OBJ_FILES = $(patsubst %.c,%.o, $(notdir $(wildcard progs/*.c)))
 TEST_GEN_FILES = $(BPF_OBJ_FILES)
diff --git a/tools/testing/selftests/bpf/progs/blkdev_ccpu_rr.c b/tools/testing/selftests/bpf/progs/blkdev_ccpu_rr.c
new file mode 100644
index 000000000000..6f66d51fe6af
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/blkdev_ccpu_rr.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Hou Tao <houtao1@huawei.com>
+ */
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+/* Index to CPU set */
+struct bpf_map_def SEC("maps") idx_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+BPF_ANNOTATE_KV_PAIR(idx_map, __u32, __u32);
+
+/* Size of CPU set */
+struct bpf_map_def SEC("maps") cnt_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+BPF_ANNOTATE_KV_PAIR(cnt_map, __u32, __u32);
+
+/* CPU set */
+struct bpf_map_def SEC("maps") cpu_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 256,
+};
+BPF_ANNOTATE_KV_PAIR(cpu_map, __u32, __u32);
+
+SEC("ccpu_demo")
+int customized_round_robin_ccpu(void *ctx)
+{
+	__u32 key = 0;
+	__u32 *idx_ptr;
+	__u32 *cnt_ptr;
+	__u32 *cpu_ptr;
+	__u32 idx;
+	__u32 cnt;
+
+	idx_ptr = bpf_map_lookup_elem(&idx_map, &key);
+	if (!idx_ptr)
+		return -1;
+	idx = (*idx_ptr)++;
+
+	cnt_ptr = bpf_map_lookup_elem(&cnt_map, &key);
+	if (!cnt_ptr)
+		return -1;
+	cnt = *cnt_ptr;
+	if (!cnt)
+		return -1;
+
+	idx %= cnt;
+	cpu_ptr = bpf_map_lookup_elem(&cpu_map, &idx);
+	if (!cpu_ptr)
+		return -1;
+
+	return *cpu_ptr;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_blkdev_ccpu.c b/tools/testing/selftests/bpf/test_blkdev_ccpu.c
new file mode 100644
index 000000000000..ec5981e7e2ed
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_blkdev_ccpu.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Hou Tao <houtao1@huawei.com>
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "bpf_util.h"
+#include "bpf_rlimit.h"
+
+static int
+print_all_levels(enum libbpf_print_level level,
+		 const char *format, va_list args)
+{
+	return vfprintf(stderr, format, args);
+}
+
+static void sig_handler(int num)
+{
+}
+
+static int parse_cpu_set(const char *str, const unsigned int **cpus,
+	int *cpu_nr)
+{
+	int total;
+	unsigned int *set;
+	int err;
+	int idx;
+	const char *from;
+
+	total = libbpf_num_possible_cpus();
+	if (total <= 0)
+		return -1;
+
+	set = calloc(total, sizeof(*set));
+	if (!set) {
+		printf("Failed to alloc cpuset (cpu nr: %d)\n", total);
+		return -1;
+	}
+
+	if (!str) {
+		for (idx = 0; idx < total; idx++)
+			set[idx] = idx;
+		*cpus = set;
+		*cpu_nr = total;
+
+		return 0;
+	}
+
+	err = 0;
+	idx = 0;
+	from = str;
+	while (1) {
+		char *endptr;
+		int start;
+		int end;
+
+		start = strtol(from, &endptr, 10);
+		if (*endptr != '-' && *endptr != ',' &&
+			(*endptr != '\0' || endptr == from)) {
+			err = -1;
+			break;
+		}
+		if (*endptr == '\0' || *endptr == ',') {
+			printf("add cpu %d\n", start);
+			set[idx++] = start;
+			if (*endptr == '\0')
+				break;
+		}
+		from = endptr + 1;
+		if (*endptr == ',')
+			continue;
+
+		end = strtol(from, &endptr, 10);
+		if (*endptr != ',' && (*endptr != '\0' || endptr == from)) {
+			err = -1;
+			break;
+		}
+		for (; start <= end; start++) {
+			printf("add cpu %d\n", start);
+			set[idx++] = start;
+		}
+		if (*endptr == '\0')
+			break;
+		from = endptr + 1;
+	}
+
+	if (err) {
+		printf("invalid cpu set spec '%s'\n", from);
+		free(set);
+		return -1;
+	}
+
+	*cpus = set;
+	*cpu_nr = idx;
+
+	return 0;
+}
+
+static int load_cpu_set(struct bpf_object *obj, const unsigned int *cpus,
+	int cnt)
+{
+	const char *name;
+	struct bpf_map *map;
+	int fd;
+	int idx;
+
+	name = "cpu_map";
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map) {
+		printf("no map %s\n", name);
+		return -1;
+	}
+
+	fd = bpf_map__fd(map);
+	if (fd < 0) {
+		printf("invalid fd for map %s\n", name);
+		return -1;
+	}
+
+	for (idx = 0; idx < cnt; idx++) {
+		if (bpf_map_update_elem(fd, &idx, &cpus[idx], 0)) {
+			printf("%s[%u] = %u error %s\n",
+					name, idx, cpus[idx], strerror(errno));
+			return -1;
+		}
+		printf("%s[%u] = %u\n", name, idx, cpus[idx]);
+	}
+
+	name = "cnt_map";
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map) {
+		printf("no map %s\n", name);
+		return -1;
+	}
+
+	fd = bpf_map__fd(map);
+	if (fd < 0) {
+		printf("invalid fd for map %s\n", name);
+		return -1;
+	}
+
+	idx = 0;
+	if (bpf_map_update_elem(fd, &idx, &cnt, 0)) {
+		printf("%s[%u] = %u error %s\n",
+				name, idx, cnt, strerror(errno));
+		return -1;
+	}
+	printf("%s[%u] = %u\n", name, idx, cnt);
+
+	return 0;
+}
+
+static void usage(const char *cmd)
+{
+	printf("Usage: %s -d blk_device [-s cpu_set]\n"
+			"  round-robin all CPUs: %s -d /dev/sda\n"
+			"  round-robin specific CPUs: %s -d /dev/sda -s 4-7,12-15\n",
+			cmd, cmd, cmd);
+	exit(1);
+}
+
+int main(int argc, char **argv)
+{
+	int opt;
+	const char *prog = "./blkdev_ccpu_rr.o";
+	const char *bdev;
+	const char *cpu_set_str = NULL;
+	const unsigned int *cpus;
+	int cpu_nr;
+	struct bpf_object *obj;
+	int prog_fd;
+	int bdev_fd;
+
+	while ((opt = getopt(argc, argv, "d:s:h")) != -1) {
+		switch (opt) {
+		case 'd':
+			bdev = optarg;
+			break;
+		case 's':
+			cpu_set_str = optarg;
+			break;
+		case 'h':
+			usage(argv[0]);
+			break;
+		}
+	}
+
+	if (!bdev)
+		usage(argv[0]);
+
+	printf("blk device %s, cpu set %s\n", bdev, cpu_set_str);
+
+	signal(SIGINT, sig_handler);
+	signal(SIGQUIT, sig_handler);
+
+	libbpf_set_print(print_all_levels);
+
+	if (parse_cpu_set(cpu_set_str, &cpus, &cpu_nr))
+		goto out;
+
+	if (bpf_prog_load(prog, BPF_PROG_TYPE_BLKDEV, &obj, &prog_fd)) {
+		printf("Failed to load %s\n", prog);
+		goto out;
+	}
+
+	if (load_cpu_set(obj, cpus, cpu_nr))
+		goto out;
+
+	bdev_fd = open(bdev, O_RDWR);
+	if (bdev_fd < 0) {
+		printf("Failed to open %s %s\n", bdev, strerror(errno));
+		goto out;
+	}
+
+	/* Attach bpf program */
+	if (bpf_prog_attach(prog_fd, bdev_fd, BPF_BLKDEV_IOC_CPU, 0)) {
+		printf("Failed to attach %s %s\n", prog, strerror(errno));
+		goto out;
+	}
+
+	printf("Attached, use Ctrl-C to detach\n\n");
+
+	pause();
+
+	if (bpf_prog_detach(bdev_fd, BPF_BLKDEV_IOC_CPU)) {
+		printf("Failed to detach %s %s\n", prog, strerror(errno));
+		goto out;
+	}
+
+	return 0;
+out:
+	return 1;
+}
-- 
2.22.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] block: use eBPF to redirect IO completion
  2019-10-14 12:28 [RFC PATCH 0/2] block: use eBPF to redirect IO completion Hou Tao
  2019-10-14 12:28 ` [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF Hou Tao
  2019-10-14 12:28 ` [RFC PATCH 2/2] selftests/bpf: add test program for redirecting IO completion CPU Hou Tao
@ 2019-10-15  1:20 ` Bob Liu
  2 siblings, 0 replies; 9+ messages in thread
From: Bob Liu @ 2019-10-15  1:20 UTC (permalink / raw)
  To: Hou Tao, linux-block, bpf, netdev, axboe, ast
  Cc: hare, osandov, ming.lei, damien.lemoal, bvanassche, daniel,
	kafai, songliubraving, yhs

On 10/14/19 8:28 PM, Hou Tao wrote:
> For network stack, RPS, namely Receive Packet Steering, is used to
> distribute network protocol processing from hardware-interrupted CPU
> to specific CPUs and alleviating soft-irq load of the interrupted CPU.
> 
> For block layer, soft-irq (for single queue device) or hard-irq
> (for multiple queue device) is used to handle IO completion, so
> RPS will be useful when the soft-irq load or the hard-irq load
> of a specific CPU is too high, or a specific CPU set is required
> to handle IO completion.
> 
> Instead of setting the CPU set used for handling IO completion
> through sysfs or procfs, we can attach an eBPF program to the
> request-queue, provide some useful info (e.g., the CPU
> which submits the request) to the program, and let the program
> decides the proper CPU for IO completion handling.
> 

But it looks like there isn't any benefit than through sysfs/procfs?

> In order to demonostrate the effect of IO completion redirection,
> a test programm is built to redirect the IO completion handling
> to all online CPUs or a specific CPU set:
> 
> 	./test_blkdev_ccpu -d /dev/vda
> or
> 	./test_blkdev_ccpu -d /dev/nvme0n1 -s 4,8,10-13
> 
> However I am still trying to find out a killer scenario for

Speaking about scenario, perhaps attaching a filter could be useful? 
So that the data can be processed the first place.

-
Bob

> the eBPF redirection, so suggestions and comments are welcome.
> 
> Regards,
> Tao
> 
> Hou Tao (2):
>   block: add support for redirecting IO completion through eBPF
>   selftests/bpf: add test program for redirecting IO completion CPU
> 
>  block/Makefile                                |   2 +-
>  block/blk-bpf.c                               | 127 +++++++++
>  block/blk-mq.c                                |  22 +-
>  block/blk-softirq.c                           |  27 +-
>  include/linux/blkdev.h                        |   3 +
>  include/linux/bpf_blkdev.h                    |   9 +
>  include/linux/bpf_types.h                     |   1 +
>  include/uapi/linux/bpf.h                      |   2 +
>  kernel/bpf/syscall.c                          |   9 +
>  tools/include/uapi/linux/bpf.h                |   2 +
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf_probes.c                 |   1 +
>  tools/testing/selftests/bpf/Makefile          |   1 +
>  .../selftests/bpf/progs/blkdev_ccpu_rr.c      |  66 +++++
>  .../testing/selftests/bpf/test_blkdev_ccpu.c  | 246 ++++++++++++++++++
>  15 files changed, 507 insertions(+), 12 deletions(-)
>  create mode 100644 block/blk-bpf.c
>  create mode 100644 include/linux/bpf_blkdev.h
>  create mode 100644 tools/testing/selftests/bpf/progs/blkdev_ccpu_rr.c
>  create mode 100644 tools/testing/selftests/bpf/test_blkdev_ccpu.c
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
  2019-10-14 12:28 ` [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF Hou Tao
@ 2019-10-15 21:04   ` Alexei Starovoitov
  2019-10-16  7:05     ` Hannes Reinecke
  2019-10-21 13:42     ` Hou Tao
  0 siblings, 2 replies; 9+ messages in thread
From: Alexei Starovoitov @ 2019-10-15 21:04 UTC (permalink / raw)
  To: Hou Tao
  Cc: linux-block, bpf, Network Development, Jens Axboe,
	Alexei Starovoitov, hare, osandov, ming.lei, damien.lemoal,
	bvanassche, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song

On Mon, Oct 14, 2019 at 5:21 AM Hou Tao <houtao1@huawei.com> wrote:
>
> For network stack, RPS, namely Receive Packet Steering, is used to
> distribute network protocol processing from hardware-interrupted CPU
> to specific CPUs and alleviating soft-irq load of the interrupted CPU.
>
> For block layer, soft-irq (for single queue device) or hard-irq
> (for multiple queue device) is used to handle IO completion, so
> RPS will be useful when the soft-irq load or the hard-irq load
> of a specific CPU is too high, or a specific CPU set is required
> to handle IO completion.
>
> Instead of setting the CPU set used for handling IO completion
> through sysfs or procfs, we can attach an eBPF program to the
> request-queue, provide some useful info (e.g., the CPU
> which submits the request) to the program, and let the program
> decides the proper CPU for IO completion handling.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
...
>
> +       rcu_read_lock();
> +       prog = rcu_dereference_protected(q->prog, 1);
> +       if (prog)
> +               bpf_ccpu = BPF_PROG_RUN(q->prog, NULL);
> +       rcu_read_unlock();
> +
>         cpu = get_cpu();
> -       if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
> -               shared = cpus_share_cache(cpu, ctx->cpu);
> +       if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) {
> +               ccpu = ctx->cpu;
> +               if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
> +                       shared = cpus_share_cache(cpu, ctx->cpu);
> +       } else
> +               ccpu = bpf_ccpu;
>
> -       if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) {
> +       if (cpu != ccpu && !shared && cpu_online(ccpu)) {
>                 rq->csd.func = __blk_mq_complete_request_remote;
>                 rq->csd.info = rq;
>                 rq->csd.flags = 0;
> -               smp_call_function_single_async(ctx->cpu, &rq->csd);
> +               smp_call_function_single_async(ccpu, &rq->csd);

Interesting idea.
Not sure whether such programability makes sense from
block layer point of view.

From bpf side having a program with NULL input context is
a bit odd. We never had such things in the past, so this patchset
won't work as-is.
Also no-input means that the program choices are quite limited.
Other than round robin and random I cannot come up with other
cpu selection ideas.
I suggest to do writable tracepoint here instead.
Take a look at trace_nbd_send_request.
BPF prog can write into 'request'.
For your use case it will be able to write into 'bpf_ccpu' local variable.
If you keep it as raw tracepoint and don't add the actual tracepoint
with TP_STRUCT__entry and TP_fast_assign then it won't be abi
and you can change it later or remove it altogether.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
  2019-10-15 21:04   ` Alexei Starovoitov
@ 2019-10-16  7:05     ` Hannes Reinecke
  2019-10-21 13:42     ` Hou Tao
  1 sibling, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2019-10-16  7:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Hou Tao
  Cc: linux-block, bpf, Network Development, Jens Axboe,
	Alexei Starovoitov, hare, osandov, ming.lei, damien.lemoal,
	bvanassche, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song

On 10/15/19 11:04 PM, Alexei Starovoitov wrote:
> On Mon, Oct 14, 2019 at 5:21 AM Hou Tao <houtao1@huawei.com> wrote:
>>
>> For network stack, RPS, namely Receive Packet Steering, is used to
>> distribute network protocol processing from hardware-interrupted CPU
>> to specific CPUs and alleviating soft-irq load of the interrupted CPU.
>>
>> For block layer, soft-irq (for single queue device) or hard-irq
>> (for multiple queue device) is used to handle IO completion, so
>> RPS will be useful when the soft-irq load or the hard-irq load
>> of a specific CPU is too high, or a specific CPU set is required
>> to handle IO completion.
>>
>> Instead of setting the CPU set used for handling IO completion
>> through sysfs or procfs, we can attach an eBPF program to the
>> request-queue, provide some useful info (e.g., the CPU
>> which submits the request) to the program, and let the program
>> decides the proper CPU for IO completion handling.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ...
>>
>> +       rcu_read_lock();
>> +       prog = rcu_dereference_protected(q->prog, 1);
>> +       if (prog)
>> +               bpf_ccpu = BPF_PROG_RUN(q->prog, NULL);
>> +       rcu_read_unlock();
>> +
>>         cpu = get_cpu();
>> -       if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
>> -               shared = cpus_share_cache(cpu, ctx->cpu);
>> +       if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) {
>> +               ccpu = ctx->cpu;
>> +               if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
>> +                       shared = cpus_share_cache(cpu, ctx->cpu);
>> +       } else
>> +               ccpu = bpf_ccpu;
>>
>> -       if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) {
>> +       if (cpu != ccpu && !shared && cpu_online(ccpu)) {
>>                 rq->csd.func = __blk_mq_complete_request_remote;
>>                 rq->csd.info = rq;
>>                 rq->csd.flags = 0;
>> -               smp_call_function_single_async(ctx->cpu, &rq->csd);
>> +               smp_call_function_single_async(ccpu, &rq->csd);
> 
> Interesting idea.
> Not sure whether such programability makes sense from
> block layer point of view.
> 
> From bpf side having a program with NULL input context is
> a bit odd. We never had such things in the past, so this patchset
> won't work as-is.
> Also no-input means that the program choices are quite limited.
> Other than round robin and random I cannot come up with other
> cpu selection ideas.
> I suggest to do writable tracepoint here instead.
> Take a look at trace_nbd_send_request.
> BPF prog can write into 'request'.
> For your use case it will be able to write into 'bpf_ccpu' local variable.
> If you keep it as raw tracepoint and don't add the actual tracepoint
> with TP_STRUCT__entry and TP_fast_assign then it won't be abi
> and you can change it later or remove it altogether.
> 
That basically was my idea, too.

Actually I was coming from a different angle, namely trying to figure
out how we could do generic error injection in the block layer.
eBPF would be one way of doing it, kprobes another.

But writable trace events ... I'll have to check if we can leverage that
here, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
  2019-10-15 21:04   ` Alexei Starovoitov
  2019-10-16  7:05     ` Hannes Reinecke
@ 2019-10-21 13:42     ` Hou Tao
  2019-10-21 13:48       ` Bart Van Assche
  1 sibling, 1 reply; 9+ messages in thread
From: Hou Tao @ 2019-10-21 13:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-block, bpf, Network Development, Jens Axboe,
	Alexei Starovoitov, hare, osandov, ming.lei, damien.lemoal,
	bvanassche, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song

Hi,

On 2019/10/16 5:04, Alexei Starovoitov wrote:
> On Mon, Oct 14, 2019 at 5:21 AM Hou Tao <houtao1@huawei.com> wrote:
>>
>> For network stack, RPS, namely Receive Packet Steering, is used to
>> distribute network protocol processing from hardware-interrupted CPU
>> to specific CPUs and alleviating soft-irq load of the interrupted CPU.
>>
>> For block layer, soft-irq (for single queue device) or hard-irq
>> (for multiple queue device) is used to handle IO completion, so
>> RPS will be useful when the soft-irq load or the hard-irq load
>> of a specific CPU is too high, or a specific CPU set is required
>> to handle IO completion.
>>
>> Instead of setting the CPU set used for handling IO completion
>> through sysfs or procfs, we can attach an eBPF program to the
>> request-queue, provide some useful info (e.g., the CPU
>> which submits the request) to the program, and let the program
>> decides the proper CPU for IO completion handling.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ...
>>
>> +       rcu_read_lock();
>> +       prog = rcu_dereference_protected(q->prog, 1);
>> +       if (prog)
>> +               bpf_ccpu = BPF_PROG_RUN(q->prog, NULL);
>> +       rcu_read_unlock();
>> +
>>         cpu = get_cpu();
>> -       if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
>> -               shared = cpus_share_cache(cpu, ctx->cpu);
>> +       if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) {
>> +               ccpu = ctx->cpu;
>> +               if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
>> +                       shared = cpus_share_cache(cpu, ctx->cpu);
>> +       } else
>> +               ccpu = bpf_ccpu;
>>
>> -       if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) {
>> +       if (cpu != ccpu && !shared && cpu_online(ccpu)) {
>>                 rq->csd.func = __blk_mq_complete_request_remote;
>>                 rq->csd.info = rq;
>>                 rq->csd.flags = 0;
>> -               smp_call_function_single_async(ctx->cpu, &rq->csd);
>> +               smp_call_function_single_async(ccpu, &rq->csd);
> 
> Interesting idea.
> Not sure whether such programability makes sense from
> block layer point of view.
> 
>>From bpf side having a program with NULL input context is
> a bit odd. We never had such things in the past, so this patchset
> won't work as-is.
No, it just works.

> Also no-input means that the program choices are quite limited.
> Other than round robin and random I cannot come up with other
> cpu selection idea> I suggest to do writable tracepoint here instead.
> Take a look at trace_nbd_send_request.
> BPF prog can write into 'request'.
> For your use case it will be able to write into 'bpf_ccpu' local variable.
> If you keep it as raw tracepoint and don't add the actual tracepoint
> with TP_STRUCT__entry and TP_fast_assign then it won't be abi
> and you can change it later or remove it altogether.
> 
Your suggestion is much simpler, so there will be no need for adding a new
program type, and all things need to be done are adding a raw tracepoint,
moving bpf_ccpu into struct request, and letting a BPF program to modify it.

I will try and thanks for your suggestions.

Regards,
Tao

> .
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
  2019-10-21 13:42     ` Hou Tao
@ 2019-10-21 13:48       ` Bart Van Assche
  2019-10-21 14:45         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2019-10-21 13:48 UTC (permalink / raw)
  To: Hou Tao, Alexei Starovoitov
  Cc: linux-block, bpf, Network Development, Jens Axboe,
	Alexei Starovoitov, hare, osandov, ming.lei, damien.lemoal,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song

On 10/21/19 6:42 AM, Hou Tao wrote:
> Your suggestion is much simpler, so there will be no need for adding a new
> program type, and all things need to be done are adding a raw tracepoint,
> moving bpf_ccpu into struct request, and letting a BPF program to modify it.

blk-mq already supports processing completions on the CPU that submitted
a request so it's not clear to me why any changes in the block layer are
being proposed for redirecting I/O completions?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
  2019-10-21 13:48       ` Bart Van Assche
@ 2019-10-21 14:45         ` Jens Axboe
  0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2019-10-21 14:45 UTC (permalink / raw)
  To: Bart Van Assche, Hou Tao, Alexei Starovoitov
  Cc: linux-block, bpf, Network Development, Alexei Starovoitov, hare,
	osandov, ming.lei, damien.lemoal, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song

On 10/21/19 7:48 AM, Bart Van Assche wrote:
> On 10/21/19 6:42 AM, Hou Tao wrote:
>> Your suggestion is much simpler, so there will be no need for adding a new
>> program type, and all things need to be done are adding a raw tracepoint,
>> moving bpf_ccpu into struct request, and letting a BPF program to modify it.
> 
> blk-mq already supports processing completions on the CPU that submitted
> a request so it's not clear to me why any changes in the block layer are
> being proposed for redirecting I/O completions?

That's where I'm getting confused as well. I'm not against adding BPF
functionality to the block layer, but this one seems a bit contrived.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-21 14:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14 12:28 [RFC PATCH 0/2] block: use eBPF to redirect IO completion Hou Tao
2019-10-14 12:28 ` [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF Hou Tao
2019-10-15 21:04   ` Alexei Starovoitov
2019-10-16  7:05     ` Hannes Reinecke
2019-10-21 13:42     ` Hou Tao
2019-10-21 13:48       ` Bart Van Assche
2019-10-21 14:45         ` Jens Axboe
2019-10-14 12:28 ` [RFC PATCH 2/2] selftests/bpf: add test program for redirecting IO completion CPU Hou Tao
2019-10-15  1:20 ` [RFC PATCH 0/2] block: use eBPF to redirect IO completion Bob Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).