All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>,
	Martin KaFai Lau <kafai@fb.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	John Fastabend <john.fastabend@gmail.com>
Cc: Linux Kernel Network Developers <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, Cong Wang <cong.wang@bytedance.com>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@resnulli.us>
Subject: Re: [RFC Patch net-next] net_sched: introduce eBPF based Qdisc
Date: Fri, 03 Sep 2021 00:27:52 +0200	[thread overview]
Message-ID: <87bl5asjdj.fsf@toke.dk> (raw)
In-Reply-To: <613136d0cf411_2c56f2086@john-XPS-13-9370.notmuch>

John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> Martin KaFai Lau <kafai@fb.com> writes:
>> 
>> > On Wed, Sep 01, 2021 at 12:42:03PM +0200, Toke Høiland-Jørgensen wrote:
>> >> John Fastabend <john.fastabend@gmail.com> writes:
>> >> 
>> >> > Cong Wang wrote:
>> >> >> On Tue, Aug 24, 2021 at 4:47 PM Martin KaFai Lau <kafai@fb.com> wrote:
>> >> >> > Please explain more on this.  What is currently missing
>> >> >> > to make qdisc in struct_ops possible?
>> >> >> 
>> >> >> I think you misunderstand this point. The reason why I avoid it is
>> >> >> _not_ anything is missing, quite oppositely, it is because it requires
>> >> >> a lot of work to implement a Qdisc with struct_ops approach, literally
>> >> >> all those struct Qdisc_ops (not to mention struct Qdisc_class_ops).
>> >> >> WIth current approach, programmers only need to implement two
>> >> >> eBPF programs (enqueue and dequeue).
>> > _if_ it is using as a qdisc object/interface,
>> > the patch "looks" easier because it obscures some of the ops/interface
>> > from the bpf user.  The user will eventually ask for more flexibility
>> > and then an on-par interface as the kernel's qdisc.  If there are some
>> > common 'ops', the common bpf code can be shared as a library in userspace
>> > or there is also kfunc call to call into the kernel implementation.
>> > For existing kernel qdisc author,  it will be easier to use the same
>> > interface also.
>> 
>> The question is if it's useful to provide the full struct_ops for
>> qdiscs? Having it would allow a BPF program to implement that interface
>> towards userspace (things like statistics, classes etc), but the
>> question is if anyone is going to bother with that given the wealth of
>> BPF-specific introspection tools already available?
>
> If its a map value then you get all the goodness with normal map
> inspection.

Yup, exactly, so why bother with struct_ops to implement all the other
qdisc ops (apart from enqueue/dequeue)?

>> My hope is that we can (longer term) develop some higher-level tools to
>> express queueing policies that can then generate the BPF code needed to
>> implement them. Or as a start just some libraries to make this easier,
>> which I think is also what you're hinting at here? :)
>
> The P4 working group has thought about QOS and queuing from P4 side if
> you want to think in terms of a DSL. Might be interesting and have
> some benefits if you want to drop into hardware offload side. For example
> compile to XDP for fast CPU architectures, Altera/Xilinx backend for FPGA or
> switch silicon for others. This was always the dream on my side maybe
> we've finally got close to actualizing it, 10 years later ;)

Yup, would love to see this! Let's just hope it doesn't take another
decade ;)

>> >> > Another idea. Rather than work with qdisc objects which creates all
>> >> > these issues with how to work with existing interfaces, filters, etc.
>> >> > Why not create an sk_buff map? Then this can be used from the existing
>> >> > egress/ingress hooks independent of the actual qdisc being used.
>> >> 
>> >> I agree. In fact, I'm working on doing just this for XDP, and I see no
>> >> reason why the map type couldn't be reused for skbs as well. Doing it
>> >> this way has a couple of benefits:
>> >> 
>> >> - It leaves more flexibility to BPF: want a simple FIFO queue? just
>> >>   implement that with a single queue map. Or do you want to build a full
>> >>   hierarchical queueing structure? Just instantiate as many queue maps
>> >>   as you need to achieve this. Etc.
>> > Agree.  Regardless how the interface may look like,
>> > I even think being able to queue/dequeue an skb into different bpf maps
>> > should be the first thing to do here.  Looking forward to your patches.
>> 
>> Thanks! Guess I should go work on them, then :D
>
> Happy to review any RFCs.
>
>> 
>> >> - The behaviour is defined entirely by BPF program behaviour, and does
>> >>   not require setting up a qdisc hierarchy in addition to writing BPF
>> >>   code.
>> > Interesting idea.  If it does not need to use the qdisc object/interface
>> > and be able to do the qdisc hierarchy setup in a programmable way, it may
>> > be nice.  It will be useful for the future patches to come with some
>> > bpf prog examples to do that.
>> 
>> Absolutely; we plan to include example algorithm implementations as well!
>
> A weighted round robin queue setup might be a useful example and easy
> to implement/understand, but slightly more interesting than a pfifo. Also
> would force understanding multiple cpus and timer issues.

Yup, some sort of RR queueing is definitely on the list!

>> >> - It should be possible to structure the hooks in a way that allows
>> >>   reusing queueing algorithm implementations between the qdisc and XDP
>> >>   layers.
>> >> 
>> >> > You mention skb should not be exposed to userspace? Why? Whats the
>> >> > reason for this? Anyways we can make kernel only maps if we want or
>> >> > scrub the data before passing it to userspace. We do this already in
>> >> > some cases.
>> >> 
>> >> Yup, that's my approach as well.
>
> Having something reported back to userspace as the value might be helpful
> for debugging/tracing. Maybe the skb->hash? Then you could set this and
> then track a skb through the stack even when its in a bpf skb queue.

Yeah. I've just been using the pointer value for my initial testing.
That's not a good solution, of course, but having a visible identifier
would be neat. skb->hash makes sense for the qdisc layer, but not for
XDP...

>> >> 
>> >> > IMO it seems cleaner and more general to allow sk_buffs
>> >> > to be stored in maps and pulled back out later for enqueue/dequeue.
>> >> 
>> >> FWIW there's some gnarly details here (for instance, we need to make
>> >> sure the BPF program doesn't leak packet references after they are
>> >> dequeued from the map). My idea is to use a scheme similar to what we do
>> >> for XDP_REDIRECT, where a helper sets some hidden variables and doesn't
>> >> actually remove the packet from the queue until the BPF program exits
>> >> (so the kernel can make sure things are accounted correctly).
>> > The verifier is tracking the sk's references.  Can it be reused to
>> > track the skb's reference?
>> 
>> I was vaguely aware that it does this, but have not looked at the
>> details. Would be great if this was possible; will see how far I get
>> with it, and iterate from there (with your help, hopefully :))
>
> Also might need to drop any socket references from the networking side
> so an enqueued sock can't hold a socket open.

Not sure I'm following you here?

-Toke


  reply	other threads:[~2021-09-02 22:28 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-21  1:02 [RFC Patch net-next] net_sched: introduce eBPF based Qdisc Cong Wang
2021-08-24 23:47 ` Martin KaFai Lau
2021-09-01  4:39   ` Cong Wang
2021-09-01  5:45     ` John Fastabend
2021-09-01 10:42       ` Toke Høiland-Jørgensen
2021-09-01 17:45         ` Martin KaFai Lau
2021-09-01 18:03           ` Alexei Starovoitov
2021-09-02 16:57           ` Toke Høiland-Jørgensen
2021-09-02 20:40             ` John Fastabend
2021-09-02 22:27               ` Toke Høiland-Jørgensen [this message]
2021-09-02 23:35                 ` Martin KaFai Lau
2021-09-03 14:44                   ` Toke Høiland-Jørgensen
2021-09-03 15:33                     ` Jamal Hadi Salim
2021-09-10  6:55                     ` Martin KaFai Lau
2021-09-10 11:31                       ` Toke Høiland-Jørgensen
2021-09-04  1:09           ` Cong Wang
2021-09-17  4:19             ` Martin KaFai Lau
2021-09-04  1:30         ` Cong Wang
2021-09-06 11:45           ` Toke Høiland-Jørgensen
2021-09-04  1:05       ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bl5asjdj.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.