All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
@ 2019-02-19 10:24 Jesper Dangaard Brouer
  2019-02-19 11:46 ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2019-02-19 10:24 UTC (permalink / raw)
  To: netdev, Daniel Borkmann, Alexei Starovoitov; +Cc: Jesper Dangaard Brouer

The skb->queue_mapping already have read access, via __sk_buff->queue_mapping.

This patch allow BPF tc qdisc clsact write access to the queue_mapping via
tc_cls_act_is_valid_access.

It is already possible to change this via TC filter action skbedit
tc-skbedit(8).  Due to the lack of TC examples, lets show one:

 # tc qdisc  add  dev ixgbe1 handle ffff: ingress
 # tc filter add  dev ixgbe1 parent ffff: matchall action skbedit queue_mapping 5
 # tc filter list dev ixgbe1 parent ffff:

The most common mistake is that XPS (Transmit Packet Steering) takes
precedence over setting skb->queue_mapping. XPS is configured per DEVICE
via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To
disable set mask=00.

The purpose of changing skb->queue_mapping is to influence the selection of
the net_device "txq" (struct netdev_queue), which influence selection of
the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When
using the MQ qdisc the txq->qdisc points to different qdiscs and associated
locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability.

Due to lack of TC examples, lets show howto attach clsact BPF programs:

 # tc qdisc  add     dev ixgbe2 clsact
 # tc filter replace dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
 # tc filter list    dev ixgbe2 egress

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 353735575204..d05ae8d05397 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6238,6 +6238,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 		case bpf_ctx_range(struct __sk_buff, tc_classid):
 		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
 		case bpf_ctx_range(struct __sk_buff, tstamp):
+		case bpf_ctx_range(struct __sk_buff, queue_mapping):
 			break;
 		default:
 			return false;
@@ -6642,9 +6643,16 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		break;
 
 	case offsetof(struct __sk_buff, queue_mapping):
-		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
-				      bpf_target_off(struct sk_buff, queue_mapping, 2,
-						     target_size));
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
+					      bpf_target_off(struct sk_buff,
+							     queue_mapping,
+							     2, target_size));
+		else
+			*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+					      bpf_target_off(struct sk_buff,
+							     queue_mapping,
+							     2, target_size));
 		break;
 
 	case offsetof(struct __sk_buff, vlan_present):


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
  2019-02-19 10:24 [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact Jesper Dangaard Brouer
@ 2019-02-19 11:46 ` Daniel Borkmann
  2019-02-19 14:52   ` Jesper Dangaard Brouer
  2019-02-19 15:57   ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Borkmann @ 2019-02-19 11:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann, Alexei Starovoitov

On 02/19/2019 11:24 AM, Jesper Dangaard Brouer wrote:
> The skb->queue_mapping already have read access, via __sk_buff->queue_mapping.
> 
> This patch allow BPF tc qdisc clsact write access to the queue_mapping via
> tc_cls_act_is_valid_access.
> 
> It is already possible to change this via TC filter action skbedit
> tc-skbedit(8).  Due to the lack of TC examples, lets show one:
> 
>  # tc qdisc  add  dev ixgbe1 handle ffff: ingress
>  # tc filter add  dev ixgbe1 parent ffff: matchall action skbedit queue_mapping 5
>  # tc filter list dev ixgbe1 parent ffff:

Using handles was in the old days, if we add examples, then lets do
something more user friendly ;)

  # tc qdisc  add     dev ixgbe1 clsact
  # tc filter replace dev ixgbe1 ingress matchall action skbedit queue_mapping 5
  # tc filter list    dev ixgbe1 ingress

> The most common mistake is that XPS (Transmit Packet Steering) takes
> precedence over setting skb->queue_mapping. XPS is configured per DEVICE
> via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To
> disable set mask=00.
> 
> The purpose of changing skb->queue_mapping is to influence the selection of
> the net_device "txq" (struct netdev_queue), which influence selection of
> the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When
> using the MQ qdisc the txq->qdisc points to different qdiscs and associated
> locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability.
> 
> Due to lack of TC examples, lets show howto attach clsact BPF programs:
> 
>  # tc qdisc  add     dev ixgbe2 clsact
>  # tc filter replace dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
>  # tc filter list    dev ixgbe2 egress
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/filter.c |   14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 353735575204..d05ae8d05397 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6238,6 +6238,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>  		case bpf_ctx_range(struct __sk_buff, tc_classid):
>  		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
>  		case bpf_ctx_range(struct __sk_buff, tstamp):
> +		case bpf_ctx_range(struct __sk_buff, queue_mapping):
>  			break;
>  		default:
>  			return false;
> @@ -6642,9 +6643,16 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>  		break;
>  
>  	case offsetof(struct __sk_buff, queue_mapping):
> -		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> -				      bpf_target_off(struct sk_buff, queue_mapping, 2,
> -						     target_size));
> +		if (type == BPF_WRITE)
> +			*insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +					      bpf_target_off(struct sk_buff,
> +							     queue_mapping,
> +							     2, target_size));
> +		else
> +			*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> +					      bpf_target_off(struct sk_buff,
> +							     queue_mapping,
> +							     2, target_size));

One thing we should avoid would be to allow user to write NO_QUEUE_MAPPING
into skb->queue_mapping so we don't hit the warn in sk_tx_queue_set(), I'd
add this into the ctx rewrite here.

>  		break;
>  
>  	case offsetof(struct __sk_buff, vlan_present):
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
  2019-02-19 11:46 ` Daniel Borkmann
@ 2019-02-19 14:52   ` Jesper Dangaard Brouer
  2019-02-19 16:18     ` Daniel Borkmann
  2019-02-19 15:57   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2019-02-19 14:52 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Daniel Borkmann, Alexei Starovoitov, brouer

On Tue, 19 Feb 2019 12:46:57 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 02/19/2019 11:24 AM, Jesper Dangaard Brouer wrote:
> > The skb->queue_mapping already have read access, via __sk_buff->queue_mapping.
> > 
> > This patch allow BPF tc qdisc clsact write access to the queue_mapping via
> > tc_cls_act_is_valid_access.
> > 
> > It is already possible to change this via TC filter action skbedit
> > tc-skbedit(8).  Due to the lack of TC examples, lets show one:
> > 
> >  # tc qdisc  add  dev ixgbe1 handle ffff: ingress
> >  # tc filter add  dev ixgbe1 parent ffff: matchall action skbedit queue_mapping 5
> >  # tc filter list dev ixgbe1 parent ffff:  
> 
> Using handles was in the old days, if we add examples, then lets do
> something more user friendly ;)
> 
>   # tc qdisc  add     dev ixgbe1 clsact
>   # tc filter replace dev ixgbe1 ingress matchall action skbedit queue_mapping 5
>   # tc filter list    dev ixgbe1 ingress
> 
> > The most common mistake is that XPS (Transmit Packet Steering) takes
> > precedence over setting skb->queue_mapping. XPS is configured per DEVICE
> > via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To
> > disable set mask=00.
> > 
> > The purpose of changing skb->queue_mapping is to influence the selection of
> > the net_device "txq" (struct netdev_queue), which influence selection of
> > the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When
> > using the MQ qdisc the txq->qdisc points to different qdiscs and associated
> > locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability.
> > 
> > Due to lack of TC examples, lets show howto attach clsact BPF programs:
> > 
> >  # tc qdisc  add     dev ixgbe2 clsact
> >  # tc filter replace dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
> >  # tc filter list    dev ixgbe2 egress
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  net/core/filter.c |   14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 353735575204..d05ae8d05397 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6238,6 +6238,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
> >  		case bpf_ctx_range(struct __sk_buff, tc_classid):
> >  		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
> >  		case bpf_ctx_range(struct __sk_buff, tstamp):
> > +		case bpf_ctx_range(struct __sk_buff, queue_mapping):
> >  			break;
> >  		default:
> >  			return false;
> > @@ -6642,9 +6643,16 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> >  		break;
> >  
> >  	case offsetof(struct __sk_buff, queue_mapping):
> > -		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> > -				      bpf_target_off(struct sk_buff, queue_mapping, 2,
> > -						     target_size));
> > +		if (type == BPF_WRITE)
> > +			*insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
> > +					      bpf_target_off(struct sk_buff,
> > +							     queue_mapping,
> > +							     2, target_size));
> > +		else
> > +			*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
> > +					      bpf_target_off(struct sk_buff,
> > +							     queue_mapping,
> > +							     2, target_size));  
> 
> One thing we should avoid would be to allow user to write NO_QUEUE_MAPPING
> into skb->queue_mapping so we don't hit the warn in sk_tx_queue_set(), I'd
> add this into the ctx rewrite here.

Makes sense. I would really appreciate if you could help me out writing
the needed BPF instructions, as I'm not an expert here.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
  2019-02-19 11:46 ` Daniel Borkmann
  2019-02-19 14:52   ` Jesper Dangaard Brouer
@ 2019-02-19 15:57   ` Jesper Dangaard Brouer
  2019-02-19 16:08     ` Daniel Borkmann
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2019-02-19 15:57 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Daniel Borkmann, Alexei Starovoitov, brouer

On Tue, 19 Feb 2019 12:46:57 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> > Due to lack of TC examples, lets show howto attach clsact BPF programs:
> > 
> >  # tc qdisc  add     dev ixgbe2 clsact
> >  # tc filter replace dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
> >  # tc filter list    dev ixgbe2 egress

Recommending the "replace" is wrong is seems, as does not replace the
existing, but keeps adding more filter entries.

What is the recommended procedure for unloading and loading a newer
version of the BPF TC program?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
  2019-02-19 15:57   ` Jesper Dangaard Brouer
@ 2019-02-19 16:08     ` Daniel Borkmann
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2019-02-19 16:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Daniel Borkmann, Alexei Starovoitov

On 02/19/2019 04:57 PM, Jesper Dangaard Brouer wrote:
> On Tue, 19 Feb 2019 12:46:57 +0100
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
>>> Due to lack of TC examples, lets show howto attach clsact BPF programs:
>>>
>>>  # tc qdisc  add     dev ixgbe2 clsact
>>>  # tc filter replace dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
>>>  # tc filter list    dev ixgbe2 egress
> 
> Recommending the "replace" is wrong is seems, as does not replace the
> existing, but keeps adding more filter entries.
> 
> What is the recommended procedure for unloading and loading a newer
> version of the BPF TC program?

You would need to specify prio / handle in order to select a particular
instance for atomic replacement:

tc filter replace dev foo {e,in}gress prio 1 handle 1 bpf da obj foo.o

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
  2019-02-19 14:52   ` Jesper Dangaard Brouer
@ 2019-02-19 16:18     ` Daniel Borkmann
  2019-02-19 19:36       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2019-02-19 16:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Daniel Borkmann, Alexei Starovoitov

On 02/19/2019 03:52 PM, Jesper Dangaard Brouer wrote:
> On Tue, 19 Feb 2019 12:46:57 +0100
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
>> On 02/19/2019 11:24 AM, Jesper Dangaard Brouer wrote:
>>> The skb->queue_mapping already have read access, via __sk_buff->queue_mapping.
>>>
>>> This patch allow BPF tc qdisc clsact write access to the queue_mapping via
>>> tc_cls_act_is_valid_access.
>>>
>>> It is already possible to change this via TC filter action skbedit
>>> tc-skbedit(8).  Due to the lack of TC examples, lets show one:
>>>
>>>  # tc qdisc  add  dev ixgbe1 handle ffff: ingress
>>>  # tc filter add  dev ixgbe1 parent ffff: matchall action skbedit queue_mapping 5
>>>  # tc filter list dev ixgbe1 parent ffff:  
>>
>> Using handles was in the old days, if we add examples, then lets do
>> something more user friendly ;)
>>
>>   # tc qdisc  add     dev ixgbe1 clsact
>>   # tc filter replace dev ixgbe1 ingress matchall action skbedit queue_mapping 5
>>   # tc filter list    dev ixgbe1 ingress
>>
>>> The most common mistake is that XPS (Transmit Packet Steering) takes
>>> precedence over setting skb->queue_mapping. XPS is configured per DEVICE
>>> via /sys/class/net/DEVICE/queues/tx-*/xps_cpus via a CPU hex mask. To
>>> disable set mask=00.
>>>
>>> The purpose of changing skb->queue_mapping is to influence the selection of
>>> the net_device "txq" (struct netdev_queue), which influence selection of
>>> the qdisc "root_lock" (via txq->qdisc->q.lock) and txq->_xmit_lock. When
>>> using the MQ qdisc the txq->qdisc points to different qdiscs and associated
>>> locks, and HARD_TX_LOCK (txq->_xmit_lock), allowing for CPU scalability.
>>>
>>> Due to lack of TC examples, lets show howto attach clsact BPF programs:
>>>
>>>  # tc qdisc  add     dev ixgbe2 clsact
>>>  # tc filter replace dev ixgbe2 egress bpf da obj XXX_kern.o sec tc_qmap2cpu
>>>  # tc filter list    dev ixgbe2 egress
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>  net/core/filter.c |   14 +++++++++++---
>>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 353735575204..d05ae8d05397 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -6238,6 +6238,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>>>  		case bpf_ctx_range(struct __sk_buff, tc_classid):
>>>  		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
>>>  		case bpf_ctx_range(struct __sk_buff, tstamp):
>>> +		case bpf_ctx_range(struct __sk_buff, queue_mapping):
>>>  			break;
>>>  		default:
>>>  			return false;
>>> @@ -6642,9 +6643,16 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>>>  		break;
>>>  
>>>  	case offsetof(struct __sk_buff, queue_mapping):
>>> -		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>>> -				      bpf_target_off(struct sk_buff, queue_mapping, 2,
>>> -						     target_size));
>>> +		if (type == BPF_WRITE)
>>> +			*insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
>>> +					      bpf_target_off(struct sk_buff,
>>> +							     queue_mapping,
>>> +							     2, target_size));
>>> +		else
>>> +			*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>>> +					      bpf_target_off(struct sk_buff,
>>> +							     queue_mapping,
>>> +							     2, target_size));  
>>
>> One thing we should avoid would be to allow user to write NO_QUEUE_MAPPING
>> into skb->queue_mapping so we don't hit the warn in sk_tx_queue_set(), I'd
>> add this into the ctx rewrite here.
> 
> Makes sense. I would really appreciate if you could help me out writing
> the needed BPF instructions, as I'm not an expert here.

Untested / uncompiled, but should be:

        case offsetof(struct __sk_buff, queue_mapping):
                if (type == BPF_WRITE) {
                        *insn++ = BPF_JMP_IMM(BPF_JGE, si->src_reg, NO_QUEUE_MAPPING, 1);
                        *insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
                                              bpf_target_off(struct sk_buff, queue_mapping, 2,
                                                             target_size));
                } else {
                        *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
                                              bpf_target_off(struct sk_buff, queue_mapping, 2,
                                                             target_size));
                }
                break;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact
  2019-02-19 16:18     ` Daniel Borkmann
@ 2019-02-19 19:36       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2019-02-19 19:36 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Daniel Borkmann, Alexei Starovoitov, brouer

On Tue, 19 Feb 2019 17:18:30 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Untested / uncompiled, but should be:
> 
>         case offsetof(struct __sk_buff, queue_mapping):
>                 if (type == BPF_WRITE) {
>                         *insn++ = BPF_JMP_IMM(BPF_JGE, si->src_reg, NO_QUEUE_MAPPING, 1);
>                         *insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg,
>                                               bpf_target_off(struct sk_buff, queue_mapping, 2,
>                                                              target_size));
>                 } else {
>                         *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
>                                               bpf_target_off(struct sk_buff, queue_mapping, 2,
>                                                              target_size));
>                 }
>                 break;

In-cooperated in V2 and tested.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-02-19 19:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-19 10:24 [PATCH bpf-next] bpf: add skb->queue_mapping write access from tc clsact Jesper Dangaard Brouer
2019-02-19 11:46 ` Daniel Borkmann
2019-02-19 14:52   ` Jesper Dangaard Brouer
2019-02-19 16:18     ` Daniel Borkmann
2019-02-19 19:36       ` Jesper Dangaard Brouer
2019-02-19 15:57   ` Jesper Dangaard Brouer
2019-02-19 16:08     ` Daniel Borkmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.