netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Matt Cover <werekraken@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	kafai@fb.com, songliubraving@fb.com, yhs@fb.com,
	Eric Dumazet <edumazet@google.com>,
	Stanislav Fomichev <sdf@google.com>,
	Matthew Cover <matthew.cover@stackpath.com>,
	mail@timurcelik.de, pabeni@redhat.com,
	Nicolas Dichtel <nicolas.dichtel@6wind.com>,
	wangli39@baidu.com, lifei.shirley@bytedance.com,
	tglx@linutronix.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH net-next] tuntap: Fallback to automq on TUNSETSTEERINGEBPF prog negative return
Date: Mon, 23 Sep 2019 13:15:54 +0800	[thread overview]
Message-ID: <b96ecf36-8f13-4a52-5355-7d88ec9e4a98@redhat.com> (raw)
In-Reply-To: <CAGyo_hondiOXi8GtqZg-YNV3A+COV=5PMHoNKaHbBjnTRTUe9Q@mail.gmail.com>


On 2019/9/23 上午11:18, Matt Cover wrote:
> On Sun, Sep 22, 2019 at 7:34 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2019/9/23 上午9:15, Matt Cover wrote:
>>> On Sun, Sep 22, 2019 at 5:51 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2019/9/23 上午6:30, Matt Cover wrote:
>>>>> On Sun, Sep 22, 2019 at 1:36 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>> On Sun, Sep 22, 2019 at 10:43:19AM -0700, Matt Cover wrote:
>>>>>>> On Sun, Sep 22, 2019 at 5:37 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>>>>>>> On Fri, Sep 20, 2019 at 11:58:43AM -0700, Matthew Cover wrote:
>>>>>>>>> Treat a negative return from a TUNSETSTEERINGEBPF bpf prog as a signal
>>>>>>>>> to fallback to tun_automq_select_queue() for tx queue selection.
>>>>>>>>>
>>>>>>>>> Compilation of this exact patch was tested.
>>>>>>>>>
>>>>>>>>> For functional testing 3 additional printk()s were added.
>>>>>>>>>
>>>>>>>>> Functional testing results (on 2 txq tap device):
>>>>>>>>>
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] ========== tun no prog ==========
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] ========== tun prog -1 ==========
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '-1'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '-1'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_automq_select_queue() ran
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] ========== tun prog 0 ==========
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '0'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] ========== tun prog 1 ==========
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '1'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '1'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] ========== tun prog 2 ==========
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: bpf_prog_run_clear_cb() returned '2'
>>>>>>>>>      [Fri Sep 20 18:33:27 2019] tuntap: tun_ebpf_select_queue() returned '0'
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Cover <matthew.cover@stackpath.com>
>>>>>>>> Could you add a bit more motivation data here?
>>>>>>> Thank you for these questions Michael.
>>>>>>>
>>>>>>> I'll plan on adding the below information to the
>>>>>>> commit message and submitting a v2 of this patch
>>>>>>> when net-next reopens. In the meantime, it would
>>>>>>> be very helpful to know if these answers address
>>>>>>> some of your concerns.
>>>>>>>
>>>>>>>> 1. why is this a good idea
>>>>>>> This change allows TUNSETSTEERINGEBPF progs to
>>>>>>> do any of the following.
>>>>>>>     1. implement queue selection for a subset of
>>>>>>>        traffic (e.g. special queue selection logic
>>>>>>>        for ipv4, but return negative and use the
>>>>>>>        default automq logic for ipv6)
>>>>>>>     2. determine there isn't sufficient information
>>>>>>>        to do proper queue selection; return
>>>>>>>        negative and use the default automq logic
>>>>>>>        for the unknown
>>>>>>>     3. implement a noop prog (e.g. do
>>>>>>>        bpf_trace_printk() then return negative and
>>>>>>>        use the default automq logic for everything)
>>>>>>>
>>>>>>>> 2. how do we know existing userspace does not rely on existing behaviour
>>>>>>> Prior to this change a negative return from a
>>>>>>> TUNSETSTEERINGEBPF prog would have been cast
>>>>>>> into a u16 and traversed netdev_cap_txqueue().
>>>>>>>
>>>>>>> In most cases netdev_cap_txqueue() would have
>>>>>>> found this value to exceed real_num_tx_queues
>>>>>>> and queue_index would be updated to 0.
>>>>>>>
>>>>>>> It is possible that a TUNSETSTEERINGEBPF prog
>>>>>>> return a negative value which when cast into a
>>>>>>> u16 results in a positive queue_index less than
>>>>>>> real_num_tx_queues. For example, on x86_64, a
>>>>>>> return value of -65535 results in a queue_index
>>>>>>> of 1; which is a valid queue for any multiqueue
>>>>>>> device.
>>>>>>>
>>>>>>> It seems unlikely, however as stated above is
>>>>>>> unfortunately possible, that existing
>>>>>>> TUNSETSTEERINGEBPF programs would choose to
>>>>>>> return a negative value rather than return the
>>>>>>> positive value which holds the same meaning.
>>>>>>>
>>>>>>> It seems more likely that future
>>>>>>> TUNSETSTEERINGEBPF programs would leverage a
>>>>>>> negative return and potentially be loaded into
>>>>>>> a kernel with the old behavior.
>>>>>> OK if we are returning a special
>>>>>> value, shouldn't we limit it? How about a special
>>>>>> value with this meaning?
>>>>>> If we are changing an ABI let's at least make it
>>>>>> extensible.
>>>>>>
>>>>> A special value with this meaning sounds
>>>>> good to me. I'll plan on adding a define
>>>>> set to -1 to cause the fallback to automq.
>>>> Can it really return -1?
>>>>
>>>> I see:
>>>>
>>>> static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>>>>                                            struct sk_buff *skb)
>>>> ...
>>>>
>>>>
>>>>> The way I was initially viewing the old
>>>>> behavior was that returning negative was
>>>>> undefined; it happened to have the
>>>>> outcomes I walked through, but not
>>>>> necessarily by design.
>>>> Having such fallback may bring extra troubles, it requires the eBPF
>>>> program know the existence of the behavior which is not a part of kernel
>>>> ABI actually. And then some eBPF program may start to rely on that which
>>>> is pretty dangerous. Note, one important consideration is to have
>>>> macvtap support where does not have any stuffs like automq.
>>>>
>>>> Thanks
>>>>
>>> How about we call this TUN_SSE_ABORT
>>> instead of TUN_SSE_DO_AUTOMQ?
>>>
>>> TUN_SSE_ABORT could be documented as
>>> falling back to the default queue
>>> selection method in either space
>>> (presumably macvtap has some queue
>>> selection method when there is no prog).
>>
>> This looks like a more complex API, we don't want userspace to differ
>> macvtap from tap too much.
>>
>> Thanks
>>
> This is barely more complex and provides
> similar to what is done in many places.
> For xdp, an XDP_PASS enacts what the
> kernel would do if there was no bpf prog.
> For tc cls in da mode, TC_ACT_OK enacts
> what the kernel would do if there was
> no bpf prog. For xt_bpf, false enacts
> what the kernel would do if there was
> no bpf prog (as long as negation
> isn't in play in the rule, I believe).


I think this is simply because you can't implement e.g 
XDP_PASS/TC_ACT_OK through eBPF itself which is not the case of steering 
prog here.


>
> I know that this is somewhat of an
> oversimplification and that each of
> these also means something else in
> the respective hookpoint, but I standby
> seeing value in this change.
>
> macvtap must have some default (i.e the
> action which it takes when no prog is
> loaded), even if that is just use queue
> 0. We can provide the same TUN_SSE_ABORT
> in userspace which does the same thing;
> enacts the default when returned. Any
> differences left between tap and macvtap
> would be in what the default is, not in
> these changes. And that difference already
> exists today.


I think it's better to safe to just drop the packet instead of trying to 
workaround it.

Thanks


>
>>>>> In order to keep the new behavior
>>>>> extensible, how should we state that a
>>>>> negative return other than -1 is
>>>>> undefined and therefore subject to
>>>>> change. Is something like this
>>>>> sufficient?
>>>>>
>>>>>      Documentation/networking/tc-actions-env-rules.txt
>>>>>
>>>>> Additionally, what should the new
>>>>> behavior implement when a negative other
>>>>> than -1 is returned? I would like to have
>>>>> it do the same thing as -1 for now, but
>>>>> with the understanding that this behavior
>>>>> is undefined. Does this sound reasonable?
>>>>>
>>>>>>>> 3. why doesn't userspace need a way to figure out whether it runs on a kernel with and
>>>>>>>>       without this patch
>>>>>>> There may be some value in exposing this fact
>>>>>>> to the ebpf prog loader. What is the standard
>>>>>>> practice here, a define?
>>>>>> We'll need something at runtime - people move binaries between kernels
>>>>>> without rebuilding then. An ioctl is one option.
>>>>>> A sysfs attribute is another, an ethtool flag yet another.
>>>>>> A combination of these is possible.
>>>>>>
>>>>>> And if we are doing this anyway, maybe let userspace select
>>>>>> the new behaviour? This way we can stay compatible with old
>>>>>> userspace...
>>>>>>
>>>>> Understood. I'll look into adding an
>>>>> ioctl to activate the new behavior. And
>>>>> perhaps a method of checking which is
>>>>> behavior is currently active (in case we
>>>>> ever want to change the default, say
>>>>> after some suitably long transition
>>>>> period).
>>>>>
>>>>>>>> thanks,
>>>>>>>> MST
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>     drivers/net/tun.c | 20 +++++++++++---------
>>>>>>>>>     1 file changed, 11 insertions(+), 9 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>>>>>>> index aab0be4..173d159 100644
>>>>>>>>> --- a/drivers/net/tun.c
>>>>>>>>> +++ b/drivers/net/tun.c
>>>>>>>>> @@ -583,35 +583,37 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>>>>>>>>>          return txq;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>>>>>>>>> +static int tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
>>>>>>>>>     {
>>>>>>>>>          struct tun_prog *prog;
>>>>>>>>>          u32 numqueues;
>>>>>>>>> -     u16 ret = 0;
>>>>>>>>> +     int ret = -1;
>>>>>>>>>
>>>>>>>>>          numqueues = READ_ONCE(tun->numqueues);
>>>>>>>>>          if (!numqueues)
>>>>>>>>>                  return 0;
>>>>>>>>>
>>>>>>>>> +     rcu_read_lock();
>>>>>>>>>          prog = rcu_dereference(tun->steering_prog);
>>>>>>>>>          if (prog)
>>>>>>>>>                  ret = bpf_prog_run_clear_cb(prog->prog, skb);
>>>>>>>>> +     rcu_read_unlock();
>>>>>>>>>
>>>>>>>>> -     return ret % numqueues;
>>>>>>>>> +     if (ret >= 0)
>>>>>>>>> +             ret %= numqueues;
>>>>>>>>> +
>>>>>>>>> +     return ret;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
>>>>>>>>>                              struct net_device *sb_dev)
>>>>>>>>>     {
>>>>>>>>>          struct tun_struct *tun = netdev_priv(dev);
>>>>>>>>> -     u16 ret;
>>>>>>>>> +     int ret;
>>>>>>>>>
>>>>>>>>> -     rcu_read_lock();
>>>>>>>>> -     if (rcu_dereference(tun->steering_prog))
>>>>>>>>> -             ret = tun_ebpf_select_queue(tun, skb);
>>>>>>>>> -     else
>>>>>>>>> +     ret = tun_ebpf_select_queue(tun, skb);
>>>>>>>>> +     if (ret < 0)
>>>>>>>>>                  ret = tun_automq_select_queue(tun, skb);
>>>>>>>>> -     rcu_read_unlock();
>>>>>>>>>
>>>>>>>>>          return ret;
>>>>>>>>>     }
>>>>>>>>> --
>>>>>>>>> 1.8.3.1

  reply	other threads:[~2019-09-23  5:16 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-20 18:58 [PATCH net-next] tuntap: Fallback to automq on TUNSETSTEERINGEBPF prog negative return Matthew Cover
2019-09-20 19:45 ` Matt Cover
2019-09-22 12:37 ` Michael S. Tsirkin
2019-09-22 17:43   ` Matt Cover
2019-09-22 20:35     ` Michael S. Tsirkin
2019-09-22 22:30       ` Matt Cover
2019-09-22 22:46         ` Matt Cover
2019-09-23  0:28           ` Matt Cover
2019-09-25 10:33           ` Michael S. Tsirkin
2019-09-23  0:51         ` Jason Wang
2019-09-23  1:15           ` Matt Cover
2019-09-23  2:34             ` Jason Wang
2019-09-23  3:18               ` Matt Cover
2019-09-23  5:15                 ` Jason Wang [this message]
2019-09-23 16:31                   ` Matt Cover
2019-09-25  4:08                     ` Jason Wang
2019-09-23  0:46     ` Jason Wang
2019-09-23  1:20       ` Matt Cover
2019-09-23  2:32         ` Jason Wang
2019-09-23  3:00           ` Matt Cover
2019-09-23  5:08             ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b96ecf36-8f13-4a52-5355-7d88ec9e4a98@redhat.com \
    --to=jasowang@redhat.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kafai@fb.com \
    --cc=lifei.shirley@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mail@timurcelik.de \
    --cc=matthew.cover@stackpath.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=nicolas.dichtel@6wind.com \
    --cc=pabeni@redhat.com \
    --cc=sdf@google.com \
    --cc=songliubraving@fb.com \
    --cc=tglx@linutronix.de \
    --cc=wangli39@baidu.com \
    --cc=werekraken@gmail.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).