bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Borkmann <daniel@iogearbox.net>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: ast@kernel.org, john.fastabend@gmail.com, yhs@fb.com,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH bpf-next v6 2/6] bpf: add redirect_peer helper
Date: Sun, 11 Oct 2020 19:16:16 +0200	[thread overview]
Message-ID: <aadbb662-bb42-05be-0943-d59ba0d3f60c@iogearbox.net> (raw)
In-Reply-To: <20201011112213.7e542de7@carbon>

On 10/11/20 11:22 AM, Jesper Dangaard Brouer wrote:
> On Sun, 11 Oct 2020 01:40:02 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
>> Add an efficient ingress to ingress netns switch that can be used out of tc BPF
>> programs in order to redirect traffic from host ns ingress into a container
>> veth device ingress without having to go via CPU backlog queue [0]. For local
>> containers this can also be utilized and path via CPU backlog queue only needs
>> to be taken once, not twice. On a high level this borrows from ipvlan which does
>> similar switch in __netif_receive_skb_core() and then iterates via another_round.
>> This helps to reduce latency for mentioned use cases.
>>
>> Pod to remote pod with redirect(), TCP_RR [1]:
>>
>>    # percpu_netperf 10.217.1.33
>>            RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
>>          MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
>>        STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
>>           MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
>>           P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
>>           P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
>>           P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )
>>
>>      TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )
>>
>> Pod to remote pod with redirect_peer(), TCP_RR:
>>
>>    # percpu_netperf 10.217.1.33
>>            RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
>>          MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
>>        STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
>>           MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
>>           P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
>>           P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
>>           P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )
>>
>>      TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )
> 
> This is awesome results and great work Daniel! :-)
> 
> I wonder if we can also support this from XDP, which can also native
> redirect into veth.  Originally I though we could add the peer netdev
> in the devmap, but AFAIK Toke showed me that this was not possible.

I think it should be possible with similar principle. What was the limitation
that you ran into with devmap for XDP?

>>    [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
>>    [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>>   drivers/net/veth.c             |  9 ++++++
>>   include/linux/netdevice.h      |  4 +++
>>   include/uapi/linux/bpf.h       | 17 +++++++++++
>>   net/core/dev.c                 | 15 ++++++++--
>>   net/core/filter.c              | 54 +++++++++++++++++++++++++++++-----
>>   tools/include/uapi/linux/bpf.h | 17 +++++++++++
>>   6 files changed, 106 insertions(+), 10 deletions(-)
>>
> [...]
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 9d55bf5d1a65..7dd015823593 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4930,7 +4930,7 @@ EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>>   
>>   static inline struct sk_buff *
>>   sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>> -		   struct net_device *orig_dev)
>> +		   struct net_device *orig_dev, bool *another)
>>   {
>>   #ifdef CONFIG_NET_CLS_ACT
>>   	struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
>> @@ -4974,7 +4974,11 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>>   		 * redirecting to another netdev
>>   		 */
>>   		__skb_push(skb, skb->mac_len);
>> -		skb_do_redirect(skb);
>> +		if (skb_do_redirect(skb) == -EAGAIN) {
>> +			__skb_pull(skb, skb->mac_len);
>> +			*another = true;
>> +			break;
>> +		}
>>   		return NULL;
>>   	case TC_ACT_CONSUMED:
>>   		return NULL;
>> @@ -5163,7 +5167,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
>>   skip_taps:
>>   #ifdef CONFIG_NET_INGRESS
>>   	if (static_branch_unlikely(&ingress_needed_key)) {
>> -		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev);
>> +		bool another = false;
>> +
>> +		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev,
>> +					 &another);
>> +		if (another)
>> +			goto another_round;
>>   		if (!skb)
>>   			goto out;
>>   
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 5da44b11e1ec..fab951c6be57 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2380,8 +2380,9 @@ static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev)
>>   
>>   /* Internal, non-exposed redirect flags. */
>>   enum {
>> -	BPF_F_NEIGH = (1ULL << 1),
>> -#define BPF_F_REDIRECT_INTERNAL	(BPF_F_NEIGH)
>> +	BPF_F_NEIGH	= (1ULL << 1),
>> +	BPF_F_PEER	= (1ULL << 2),
>> +#define BPF_F_REDIRECT_INTERNAL	(BPF_F_NEIGH | BPF_F_PEER)
>>   };
>>   
>>   BPF_CALL_3(bpf_clone_redirect, struct sk_buff *, skb, u32, ifindex, u64, flags)
>> @@ -2430,19 +2431,35 @@ EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
>>   int skb_do_redirect(struct sk_buff *skb)
>>   {
>>   	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>> +	struct net *net = dev_net(skb->dev);
>>   	struct net_device *dev;
>>   	u32 flags = ri->flags;
>>   
>> -	dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index);
>> +	dev = dev_get_by_index_rcu(net, ri->tgt_index);
>>   	ri->tgt_index = 0;
>> -	if (unlikely(!dev)) {
>> -		kfree_skb(skb);
>> -		return -EINVAL;
>> +	ri->flags = 0;
>> +	if (unlikely(!dev))
>> +		goto out_drop;
>> +	if (flags & BPF_F_PEER) {
>> +		const struct net_device_ops *ops = dev->netdev_ops;
>> +
>> +		if (unlikely(!ops->ndo_get_peer_dev ||
>> +			     !skb_at_tc_ingress(skb)))
>> +			goto out_drop;
>> +		dev = ops->ndo_get_peer_dev(dev);
>> +		if (unlikely(!dev ||
>> +			     !is_skb_forwardable(dev, skb) ||
> 
> Again a MTU "transmissing" check on ingress "receive" path, but we can
> take that discussion after this is merged, as this keeps the current
> behavior.

Yep, agree; also it checks whether dev is up which we need here too.

>> +			     net_eq(net, dev_net(dev))))
>> +			goto out_drop;
>> +		skb->dev = dev;
> 
> Don't we need to clean some more state when this packet gets redirected
> into another namespace?
> 
> Like skb_scrub_packet(), or is that not needed?  (p.s. I would like to
> avoid it, as it e.g. clears the skb->mark.)

Not needed, the traffic egress path from a netns is scrubbing already, and
ingress traffic in hostns the BPF prog can do if needed given it has full
control, this is similar to how ipvlan does it for traffic into container.

>> +		return -EAGAIN;
>>   	}
>> -
>>   	return flags & BPF_F_NEIGH ?
>>   	       __bpf_redirect_neigh(skb, dev) :
>>   	       __bpf_redirect(skb, dev, flags);
>> +out_drop:
>> +	kfree_skb(skb);
>> +	return -EINVAL;
>>   }
> 
> 
> 


  reply	other threads:[~2020-10-11 17:16 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-10 23:40 [PATCH bpf-next v6 0/6] Follow-up BPF helper improvements Daniel Borkmann
2020-10-10 23:40 ` [PATCH bpf-next v6 1/6] bpf: improve bpf_redirect_neigh helper description Daniel Borkmann
2020-10-10 23:40 ` [PATCH bpf-next v6 2/6] bpf: add redirect_peer helper Daniel Borkmann
2020-10-11  9:22   ` Jesper Dangaard Brouer
2020-10-11 17:16     ` Daniel Borkmann [this message]
2020-10-12  2:50       ` David Ahern
2020-10-12  9:41         ` Jesper Dangaard Brouer
2020-10-10 23:40 ` [PATCH bpf-next v6 3/6] bpf: allow for map-in-map with dynamic inner array map entries Daniel Borkmann
2020-10-10 23:49   ` Andrii Nakryiko
2020-10-10 23:40 ` [PATCH bpf-next v6 4/6] bpf, selftests: add test for different array inner map size Daniel Borkmann
2020-10-10 23:40 ` [PATCH bpf-next v6 5/6] bpf, selftests: make redirect_neigh test more extensible Daniel Borkmann
2020-10-10 23:40 ` [PATCH bpf-next v6 6/6] bpf, selftests: add redirect_peer selftest Daniel Borkmann
2020-10-11 17:40 ` [PATCH bpf-next v6 0/6] Follow-up BPF helper improvements patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aadbb662-bb42-05be-0943-d59ba0d3f60c@iogearbox.net \
    --to=daniel@iogearbox.net \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=toke@redhat.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).