Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Hou Tao <houtao1@huawei.com>
Cc: linux-block@vger.kernel.org, bpf <bpf@vger.kernel.org>,
	Network Development <netdev@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>, Alexei Starovoitov <ast@kernel.org>,
	hare@suse.com, osandov@fb.com, ming.lei@redhat.com,
	damien.lemoal@wdc.com, bvanassche <bvanassche@acm.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>
Subject: Re: [RFC PATCH 1/2] block: add support for redirecting IO completion through eBPF
Date: Tue, 15 Oct 2019 14:04:40 -0700	[thread overview]
Message-ID: <CAADnVQ+UJK41VL-epYGxrRzqL_UsC+X=J8EXEn2i8P+TPGA_jg@mail.gmail.com> (raw)
In-Reply-To: <20191014122833.64908-2-houtao1@huawei.com>

On Mon, Oct 14, 2019 at 5:21 AM Hou Tao <houtao1@huawei.com> wrote:
>
> For network stack, RPS, namely Receive Packet Steering, is used to
> distribute network protocol processing from hardware-interrupted CPU
> to specific CPUs and alleviating soft-irq load of the interrupted CPU.
>
> For block layer, soft-irq (for single queue device) or hard-irq
> (for multiple queue device) is used to handle IO completion, so
> RPS will be useful when the soft-irq load or the hard-irq load
> of a specific CPU is too high, or a specific CPU set is required
> to handle IO completion.
>
> Instead of setting the CPU set used for handling IO completion
> through sysfs or procfs, we can attach an eBPF program to the
> request-queue, provide some useful info (e.g., the CPU
> which submits the request) to the program, and let the program
> decides the proper CPU for IO completion handling.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
...
>
> +       rcu_read_lock();
> +       prog = rcu_dereference_protected(q->prog, 1);
> +       if (prog)
> +               bpf_ccpu = BPF_PROG_RUN(q->prog, NULL);
> +       rcu_read_unlock();
> +
>         cpu = get_cpu();
> -       if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
> -               shared = cpus_share_cache(cpu, ctx->cpu);
> +       if (bpf_ccpu < 0 || !cpu_online(bpf_ccpu)) {
> +               ccpu = ctx->cpu;
> +               if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
> +                       shared = cpus_share_cache(cpu, ctx->cpu);
> +       } else
> +               ccpu = bpf_ccpu;
>
> -       if (cpu != ctx->cpu && !shared && cpu_online(ctx->cpu)) {
> +       if (cpu != ccpu && !shared && cpu_online(ccpu)) {
>                 rq->csd.func = __blk_mq_complete_request_remote;
>                 rq->csd.info = rq;
>                 rq->csd.flags = 0;
> -               smp_call_function_single_async(ctx->cpu, &rq->csd);
> +               smp_call_function_single_async(ccpu, &rq->csd);

Interesting idea.
Not sure whether such programability makes sense from
block layer point of view.

From bpf side having a program with NULL input context is
a bit odd. We never had such things in the past, so this patchset
won't work as-is.
Also no-input means that the program choices are quite limited.
Other than round robin and random I cannot come up with other
cpu selection ideas.
I suggest to do writable tracepoint here instead.
Take a look at trace_nbd_send_request.
BPF prog can write into 'request'.
For your use case it will be able to write into 'bpf_ccpu' local variable.
If you keep it as raw tracepoint and don't add the actual tracepoint
with TP_STRUCT__entry and TP_fast_assign then it won't be abi
and you can change it later or remove it altogether.