All of lore.kernel.org
 help / color / mirror / Atom feed
* Modification to skb->queue_mapping affecting performance
@ 2016-09-13 22:59 Michael Ma
  2016-09-13 23:07 ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ma @ 2016-09-13 22:59 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Hi -

We currently use mqprio on ifb to work around the qdisc root lock
contention on the receiver side. The problem that we found was that
queue_mapping is already set when redirecting from ingress qdisc to
ifb (based on RX selection, I guess?) so the TX queue selection is not
based on priority.

Then we implemented a filter which can set skb->queue_mapping to 0 so
that TX queue selection can be done as expected and flows with
different priorities will go through different TX queues. However with
the queue_mapping recomputed, we found the achievable bandwidth with
small packets (512 bytes) dropped significantly if they're targeting
different queues. From perf profile I don't see any bottleneck from
CPU perspective.

Any thoughts on why modifying queue_mapping will have this kind of
effect? Also is there any better way of achieving receiver side
throttling using HTB while avoiding the qdisc root lock on ifb?

Thanks,
Michael

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-13 22:59 Modification to skb->queue_mapping affecting performance Michael Ma
@ 2016-09-13 23:07 ` Eric Dumazet
       [not found]   ` <CAAmHdhw6JXYtYNjcJchieW-KcwRkNnFmXTRTXrLQS_Kh+rfP2w@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-13 23:07 UTC (permalink / raw)
  To: Michael Ma; +Cc: Linux Kernel Network Developers

On Tue, 2016-09-13 at 15:59 -0700, Michael Ma wrote:
> Hi -
> 
> We currently use mqprio on ifb to work around the qdisc root lock
> contention on the receiver side. The problem that we found was that
> queue_mapping is already set when redirecting from ingress qdisc to
> ifb (based on RX selection, I guess?) so the TX queue selection is not
> based on priority.
> 
> Then we implemented a filter which can set skb->queue_mapping to 0 so
> that TX queue selection can be done as expected and flows with
> different priorities will go through different TX queues. However with
> the queue_mapping recomputed, we found the achievable bandwidth with
> small packets (512 bytes) dropped significantly if they're targeting
> different queues. From perf profile I don't see any bottleneck from
> CPU perspective.
> 
> Any thoughts on why modifying queue_mapping will have this kind of
> effect? Also is there any better way of achieving receiver side
> throttling using HTB while avoiding the qdisc root lock on ifb?

But, how many queues do you have on your NIC, and have you setup ifb to
have a same number of queues ?

There is no qdisc lock contention anymore AFAIK, since each cpu will use
a dedicate IFB queue and tasklet.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
       [not found]   ` <CAAmHdhw6JXYtYNjcJchieW-KcwRkNnFmXTRTXrLQS_Kh+rfP2w@mail.gmail.com>
@ 2016-09-14  0:09     ` Eric Dumazet
  2016-09-14  0:23       ` Michael Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-14  0:09 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev

On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote:

> The RX queue number I found from "ls /sys/class/net/eth0/queues" is
> 64. (is this the correct way of identifying the queue number on NIC?)
> I setup ifb with 24 queues which is equal to the TX queue number of
> eth0 and also the number of CPU cores.

Please do not drop netdev@ from this mail exchange.

ethtool -l eth0

> 
> > There is no qdisc lock contention anymore AFAIK, since each cpu will use
> > a dedicate IFB queue and tasklet.
> >
> How is this achieved? I thought qdisc on ifb will still be protected
> by the qdisc root lock in __dev_xmit_skb() so essentially all threads
> processing qdisc are still serialized without using MQ?

You have to properly setup ifb/mq like in :

# netem based setup, installed at receiver side only
ETH=eth0
IFB=ifb10
#DELAY="delay 100ms"
EST="est 1sec 4sec"
#REORDER=1000us
#LOSS="loss 2.0"
TXQ=24  # change this to number of TX queues on the physical NIC

ip link add $IFB numtxqueues $TXQ type ifb
ip link set dev $IFB up

tc qdisc del dev $ETH ingress 2>/dev/null
tc qdisc add dev $ETH ingress 2>/dev/null

tc filter add dev $ETH parent ffff: \
   protocol ip u32 match u32 0 0 flowid 1:1 \
	action mirred egress redirect dev $IFB

tc qdisc del dev $IFB root 2>/dev/null

tc qdisc add dev $IFB root handle 1: mq
for i in `seq 1 $TXQ`
do
 slot=$( printf %x $(( i )) )
 tc qd add dev $IFB parent 1:$slot $EST netem \
	limit 100000 $DELAY $REORDER $LOSS
done

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14  0:09     ` Eric Dumazet
@ 2016-09-14  0:23       ` Michael Ma
  2016-09-14  1:18         ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ma @ 2016-09-14  0:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-13 17:09 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote:
>
>> The RX queue number I found from "ls /sys/class/net/eth0/queues" is
>> 64. (is this the correct way of identifying the queue number on NIC?)
>> I setup ifb with 24 queues which is equal to the TX queue number of
>> eth0 and also the number of CPU cores.
>
> Please do not drop netdev@ from this mail exchange.

Sorry that I accidentally dropped that.
>
> ethtool -l eth0
>
>>
>> > There is no qdisc lock contention anymore AFAIK, since each cpu will use
>> > a dedicate IFB queue and tasklet.
>> >
>> How is this achieved? I thought qdisc on ifb will still be protected
>> by the qdisc root lock in __dev_xmit_skb() so essentially all threads
>> processing qdisc are still serialized without using MQ?
>
> You have to properly setup ifb/mq like in :
>
> # netem based setup, installed at receiver side only
> ETH=eth0
> IFB=ifb10
> #DELAY="delay 100ms"
> EST="est 1sec 4sec"
> #REORDER=1000us
> #LOSS="loss 2.0"
> TXQ=24  # change this to number of TX queues on the physical NIC
>
> ip link add $IFB numtxqueues $TXQ type ifb
> ip link set dev $IFB up
>
> tc qdisc del dev $ETH ingress 2>/dev/null
> tc qdisc add dev $ETH ingress 2>/dev/null
>
> tc filter add dev $ETH parent ffff: \
>    protocol ip u32 match u32 0 0 flowid 1:1 \
>         action mirred egress redirect dev $IFB
>
> tc qdisc del dev $IFB root 2>/dev/null
>
> tc qdisc add dev $IFB root handle 1: mq
> for i in `seq 1 $TXQ`
> do
>  slot=$( printf %x $(( i )) )
>  tc qd add dev $IFB parent 1:$slot $EST netem \
>         limit 100000 $DELAY $REORDER $LOSS
> done
>
>
If I understand correctly this is still to associate a qdisc with each
ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
to divide the bandwidth of each class in HTB by the number of TX
queues for each individual HTB qdisc associated?

My original idea was to attach a HTB qdisc for each ifb queue
representing a set of flows not sharing bandwidth with others so that
root lock contention still happens but only affects flows in the same
HTB. Did I understand the root lock contention issue incorrectly for
ifb? I do see some comments in __dev_queue_xmit() about using a
different code path for software devices which bypasses
__dev_xmit_skb(). Does this mean ifb won't go through
__dev_xmit_skb()?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14  0:23       ` Michael Ma
@ 2016-09-14  1:18         ` Eric Dumazet
  2016-09-14  5:13           ` Michael Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-14  1:18 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev

On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:

> If I understand correctly this is still to associate a qdisc with each
> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
> to divide the bandwidth of each class in HTB by the number of TX
> queues for each individual HTB qdisc associated?
> 
> My original idea was to attach a HTB qdisc for each ifb queue
> representing a set of flows not sharing bandwidth with others so that
> root lock contention still happens but only affects flows in the same
> HTB. Did I understand the root lock contention issue incorrectly for
> ifb? I do see some comments in __dev_queue_xmit() about using a
> different code path for software devices which bypasses
> __dev_xmit_skb(). Does this mean ifb won't go through
> __dev_xmit_skb()?

You can install HTB on all of your MQ children for sure.

Again, there is no qdisc lock contention if you properly use MQ.

Now if you _need_ to install a single qdisc for whatever reason, then
maybe you want to use a single rx queue on the NIC, to reduce lock
contention ;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14  1:18         ` Eric Dumazet
@ 2016-09-14  5:13           ` Michael Ma
  2016-09-14  5:19             ` Michael Ma
  2016-09-14  5:22             ` Eric Dumazet
  0 siblings, 2 replies; 14+ messages in thread
From: Michael Ma @ 2016-09-14  5:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>
>> If I understand correctly this is still to associate a qdisc with each
>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>> to divide the bandwidth of each class in HTB by the number of TX
>> queues for each individual HTB qdisc associated?
>>
>> My original idea was to attach a HTB qdisc for each ifb queue
>> representing a set of flows not sharing bandwidth with others so that
>> root lock contention still happens but only affects flows in the same
>> HTB. Did I understand the root lock contention issue incorrectly for
>> ifb? I do see some comments in __dev_queue_xmit() about using a
>> different code path for software devices which bypasses
>> __dev_xmit_skb(). Does this mean ifb won't go through
>> __dev_xmit_skb()?
>
> You can install HTB on all of your MQ children for sure.
>
> Again, there is no qdisc lock contention if you properly use MQ.
>
> Now if you _need_ to install a single qdisc for whatever reason, then
> maybe you want to use a single rx queue on the NIC, to reduce lock
> contention ;)
>
>
I don't intend to install multiple qdisc - the only reason that I'm
doing this now is to leverage MQ to workaround the lock contention,
and based on the profile this all worked. However to simplify the way
to setup HTB I wanted to use TXQ to partition HTB classes so that a
HTB class only belongs to one TXQ, which also requires mapping skb to
TXQ using some rules (here I'm using priority but I assume it's
straightforward to use other information such as classid). And the
problem I found here is that when using priority to infer the TXQ so
that queue_mapping is changed, bandwidth is affected significantly -
the only thing I can guess is that due to queue switch, there are more
cache misses assuming processor cores have a static mapping to all the
queues. Any suggestion on what to do next for the investigation?

I would also guess that this should be a common problem if anyone
wants to use MQ+IFB to workaround the qdisc lock contention on the
receiver side and classful qdisc is used on IFB, but haven't really
found a similar thread here...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14  5:13           ` Michael Ma
@ 2016-09-14  5:19             ` Michael Ma
  2016-09-14  5:22             ` Eric Dumazet
  1 sibling, 0 replies; 14+ messages in thread
From: Michael Ma @ 2016-09-14  5:19 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang; +Cc: netdev

2016-09-13 22:13 GMT-07:00 Michael Ma <make0818@gmail.com>:
> 2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>>
>>> If I understand correctly this is still to associate a qdisc with each
>>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>>> to divide the bandwidth of each class in HTB by the number of TX
>>> queues for each individual HTB qdisc associated?
>>>
>>> My original idea was to attach a HTB qdisc for each ifb queue
>>> representing a set of flows not sharing bandwidth with others so that
>>> root lock contention still happens but only affects flows in the same
>>> HTB. Did I understand the root lock contention issue incorrectly for
>>> ifb? I do see some comments in __dev_queue_xmit() about using a
>>> different code path for software devices which bypasses
>>> __dev_xmit_skb(). Does this mean ifb won't go through
>>> __dev_xmit_skb()?
>>
>> You can install HTB on all of your MQ children for sure.
>>
>> Again, there is no qdisc lock contention if you properly use MQ.
>>
>> Now if you _need_ to install a single qdisc for whatever reason, then
>> maybe you want to use a single rx queue on the NIC, to reduce lock
>> contention ;)

Yes - this might reduce lock contention but there would still be
contention and I'm really looking for more concurrency...

>>
>>
> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
>
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

Hi Cong - I saw quite some threads from you regarding to ingress qdisc
+ MQ and issues for queue_mapping. Do you by any chance have a similar
setup? (classful qdiscs associated to the queues of IFB which requires
queue_mapping modification so that the qdisc selection is done at
queue selection time based on information such as skb
priority/classid. Would appreciate any suggestions.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14  5:13           ` Michael Ma
  2016-09-14  5:19             ` Michael Ma
@ 2016-09-14  5:22             ` Eric Dumazet
  2016-09-14 17:46               ` Michael Ma
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-14  5:22 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev

On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:

> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
> 
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

But why are you changing the queue ?

NIC already does the proper RSS thing, meaning all packets of one flow
should land on one RX queue. No need to ' classify yourself and risk
lock contention' 

I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
problem.

Do you really need to rate limit flows ? Not clear what are your goals,
why for example you use HTB to begin with.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14  5:22             ` Eric Dumazet
@ 2016-09-14 17:46               ` Michael Ma
  2016-09-16  0:51                 ` Michael Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ma @ 2016-09-14 17:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>
>> I don't intend to install multiple qdisc - the only reason that I'm
>> doing this now is to leverage MQ to workaround the lock contention,
>> and based on the profile this all worked. However to simplify the way
>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>> HTB class only belongs to one TXQ, which also requires mapping skb to
>> TXQ using some rules (here I'm using priority but I assume it's
>> straightforward to use other information such as classid). And the
>> problem I found here is that when using priority to infer the TXQ so
>> that queue_mapping is changed, bandwidth is affected significantly -
>> the only thing I can guess is that due to queue switch, there are more
>> cache misses assuming processor cores have a static mapping to all the
>> queues. Any suggestion on what to do next for the investigation?
>>
>> I would also guess that this should be a common problem if anyone
>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>> receiver side and classful qdisc is used on IFB, but haven't really
>> found a similar thread here...
>
> But why are you changing the queue ?
>
> NIC already does the proper RSS thing, meaning all packets of one flow
> should land on one RX queue. No need to ' classify yourself and risk
> lock contention'
>
> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
> problem.
>
> Do you really need to rate limit flows ? Not clear what are your goals,
> why for example you use HTB to begin with.
>
Yes. My goal is to set different min/max bandwidth limits for
different processes, so we started with HTB. However with HTB the
qdisc root lock contention caused some unintended correlation between
flows in different classes. For example if some flows belonging to one
class have large amount of small packets, other flows in a different
class will get their effective bandwidth reduced because they'll wait
longer for the root lock. Using MQ this can be avoided because I'll
just put flows belonging to one class to its dedicated TXQ. Then
classes within one HTB on a TXQ will still have the lock contention
problem but classes in different HTB will use different root locks so
the contention doesn't exist.

This also means that I'll need to classify packets to different
TXQ/HTB based on some skb metadata (essentially similar to what mqprio
is doing). So TXQ might need to be switched to achieve this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-14 17:46               ` Michael Ma
@ 2016-09-16  0:51                 ` Michael Ma
  2016-09-16 17:57                   ` Michael Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ma @ 2016-09-16  0:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-14 10:46 GMT-07:00 Michael Ma <make0818@gmail.com>:
> 2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>>
>>> I don't intend to install multiple qdisc - the only reason that I'm
>>> doing this now is to leverage MQ to workaround the lock contention,
>>> and based on the profile this all worked. However to simplify the way
>>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>>> HTB class only belongs to one TXQ, which also requires mapping skb to
>>> TXQ using some rules (here I'm using priority but I assume it's
>>> straightforward to use other information such as classid). And the
>>> problem I found here is that when using priority to infer the TXQ so
>>> that queue_mapping is changed, bandwidth is affected significantly -
>>> the only thing I can guess is that due to queue switch, there are more
>>> cache misses assuming processor cores have a static mapping to all the
>>> queues. Any suggestion on what to do next for the investigation?
>>>
>>> I would also guess that this should be a common problem if anyone
>>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>>> receiver side and classful qdisc is used on IFB, but haven't really
>>> found a similar thread here...
>>
>> But why are you changing the queue ?
>>
>> NIC already does the proper RSS thing, meaning all packets of one flow
>> should land on one RX queue. No need to ' classify yourself and risk
>> lock contention'
>>
>> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
>> problem.
>>
>> Do you really need to rate limit flows ? Not clear what are your goals,
>> why for example you use HTB to begin with.
>>
> Yes. My goal is to set different min/max bandwidth limits for
> different processes, so we started with HTB. However with HTB the
> qdisc root lock contention caused some unintended correlation between
> flows in different classes. For example if some flows belonging to one
> class have large amount of small packets, other flows in a different
> class will get their effective bandwidth reduced because they'll wait
> longer for the root lock. Using MQ this can be avoided because I'll
> just put flows belonging to one class to its dedicated TXQ. Then
> classes within one HTB on a TXQ will still have the lock contention
> problem but classes in different HTB will use different root locks so
> the contention doesn't exist.
>
> This also means that I'll need to classify packets to different
> TXQ/HTB based on some skb metadata (essentially similar to what mqprio
> is doing). So TXQ might need to be switched to achieve this.

My current theory to this problem is that tasklets in IFB might be
scheduled to the same cpu core if the RXQ happens to be the same for
two different flows. When queue_mapping is modified and multiple flows
are concentrated to the same IFB TXQ because they need to be
controlled by the same HTB, they'll have to use the same tasklet
because of the way IFB is implemented. So if other flows belonging to
a different TXQ/tasklet happens to be scheduled on the same core, that
core can be overloaded and becomes the bottleneck. Without modifying
the queue_mapping the chance of this contention is much lower.

This is a speculation based on the increased si time in softirqd
process. I'll try to affinitize each tasklet with a cpu core to verify
whether this is the problem. I also noticed that in the past there was
a similar proposal of scheduling the tasklet to a dedicated core which
was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try
something similar to verify this theory.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-16  0:51                 ` Michael Ma
@ 2016-09-16 17:57                   ` Michael Ma
  2016-09-16 19:53                     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ma @ 2016-09-16 17:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-15 17:51 GMT-07:00 Michael Ma <make0818@gmail.com>:
> 2016-09-14 10:46 GMT-07:00 Michael Ma <make0818@gmail.com>:
>> 2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>>> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>>>
>>>> I don't intend to install multiple qdisc - the only reason that I'm
>>>> doing this now is to leverage MQ to workaround the lock contention,
>>>> and based on the profile this all worked. However to simplify the way
>>>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>>>> HTB class only belongs to one TXQ, which also requires mapping skb to
>>>> TXQ using some rules (here I'm using priority but I assume it's
>>>> straightforward to use other information such as classid). And the
>>>> problem I found here is that when using priority to infer the TXQ so
>>>> that queue_mapping is changed, bandwidth is affected significantly -
>>>> the only thing I can guess is that due to queue switch, there are more
>>>> cache misses assuming processor cores have a static mapping to all the
>>>> queues. Any suggestion on what to do next for the investigation?
>>>>
>>>> I would also guess that this should be a common problem if anyone
>>>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>>>> receiver side and classful qdisc is used on IFB, but haven't really
>>>> found a similar thread here...
>>>
>>> But why are you changing the queue ?
>>>
>>> NIC already does the proper RSS thing, meaning all packets of one flow
>>> should land on one RX queue. No need to ' classify yourself and risk
>>> lock contention'
>>>
>>> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
>>> problem.
>>>
>>> Do you really need to rate limit flows ? Not clear what are your goals,
>>> why for example you use HTB to begin with.
>>>
>> Yes. My goal is to set different min/max bandwidth limits for
>> different processes, so we started with HTB. However with HTB the
>> qdisc root lock contention caused some unintended correlation between
>> flows in different classes. For example if some flows belonging to one
>> class have large amount of small packets, other flows in a different
>> class will get their effective bandwidth reduced because they'll wait
>> longer for the root lock. Using MQ this can be avoided because I'll
>> just put flows belonging to one class to its dedicated TXQ. Then
>> classes within one HTB on a TXQ will still have the lock contention
>> problem but classes in different HTB will use different root locks so
>> the contention doesn't exist.
>>
>> This also means that I'll need to classify packets to different
>> TXQ/HTB based on some skb metadata (essentially similar to what mqprio
>> is doing). So TXQ might need to be switched to achieve this.
>
> My current theory to this problem is that tasklets in IFB might be
> scheduled to the same cpu core if the RXQ happens to be the same for
> two different flows. When queue_mapping is modified and multiple flows
> are concentrated to the same IFB TXQ because they need to be
> controlled by the same HTB, they'll have to use the same tasklet
> because of the way IFB is implemented. So if other flows belonging to
> a different TXQ/tasklet happens to be scheduled on the same core, that
> core can be overloaded and becomes the bottleneck. Without modifying
> the queue_mapping the chance of this contention is much lower.
>
> This is a speculation based on the increased si time in softirqd
> process. I'll try to affinitize each tasklet with a cpu core to verify
> whether this is the problem. I also noticed that in the past there was
> a similar proposal of scheduling the tasklet to a dedicated core which
> was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try
> something similar to verify this theory.

This is actually the problem - if flows from different RX queues are
switched to the same RX queue in IFB, they'll use different processor
context with the same tasklet, and the processor context of different
tasklets might be the same. So multiple tasklets in IFB competes for
the same core when queue is switched.

The following simple fix proved this - with this change even switching
the queue won't affect small packet bandwidth/latency anymore:

in ifb.c:

-       struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
+       struct ifb_q_private *txp = dp->tx_private +
(smp_processor_id() % dev->num_tx_queues);

This should be more efficient since we're not sending the task to a
different processor, instead we try to queue the packet to an
appropriate tasklet based on the processor ID. Will this cause any
packet out-of-order problem? If packets from the same flow are queued
to the same RX queue due to RSS, and processor affinity is set for RX
queues, I assume packets from the same flow will end up in the same
core when tasklet is scheduled. But I might have missed some uncommon
cases here... Would appreciate if anyone can provide more insights.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-16 17:57                   ` Michael Ma
@ 2016-09-16 19:53                     ` Eric Dumazet
  2016-09-16 22:00                       ` Michael Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2016-09-16 19:53 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev

On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:

> This is actually the problem - if flows from different RX queues are
> switched to the same RX queue in IFB, they'll use different processor
> context with the same tasklet, and the processor context of different
> tasklets might be the same. So multiple tasklets in IFB competes for
> the same core when queue is switched.
> 
> The following simple fix proved this - with this change even switching
> the queue won't affect small packet bandwidth/latency anymore:
> 
> in ifb.c:
> 
> -       struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
> +       struct ifb_q_private *txp = dp->tx_private +
> (smp_processor_id() % dev->num_tx_queues);
> 
> This should be more efficient since we're not sending the task to a
> different processor, instead we try to queue the packet to an
> appropriate tasklet based on the processor ID. Will this cause any
> packet out-of-order problem? If packets from the same flow are queued
> to the same RX queue due to RSS, and processor affinity is set for RX
> queues, I assume packets from the same flow will end up in the same
> core when tasklet is scheduled. But I might have missed some uncommon
> cases here... Would appreciate if anyone can provide more insights.

Wait, don't you have proper smp affinity for the RX queues on your NIC ?

( Documentation/networking/scaling.txt RSS IRQ Configuration )

A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
the driver queue is locked before ndo_start_xmit())  (for non
NETIF_F_LLTX drivers at least)

In case of ifb, __skb_queue_tail(&txp->rq, skb); could corrupt the skb
list.

In any case, you could have an action to do this before reaching IFB.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-16 19:53                     ` Eric Dumazet
@ 2016-09-16 22:00                       ` Michael Ma
  2016-09-23 23:21                         ` Michael Ma
  0 siblings, 1 reply; 14+ messages in thread
From: Michael Ma @ 2016-09-16 22:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-16 12:53 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:
>
>> This is actually the problem - if flows from different RX queues are
>> switched to the same RX queue in IFB, they'll use different processor
>> context with the same tasklet, and the processor context of different
>> tasklets might be the same. So multiple tasklets in IFB competes for
>> the same core when queue is switched.
>>
>> The following simple fix proved this - with this change even switching
>> the queue won't affect small packet bandwidth/latency anymore:
>>
>> in ifb.c:
>>
>> -       struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
>> +       struct ifb_q_private *txp = dp->tx_private +
>> (smp_processor_id() % dev->num_tx_queues);
>>
>> This should be more efficient since we're not sending the task to a
>> different processor, instead we try to queue the packet to an
>> appropriate tasklet based on the processor ID. Will this cause any
>> packet out-of-order problem? If packets from the same flow are queued
>> to the same RX queue due to RSS, and processor affinity is set for RX
>> queues, I assume packets from the same flow will end up in the same
>> core when tasklet is scheduled. But I might have missed some uncommon
>> cases here... Would appreciate if anyone can provide more insights.
>
> Wait, don't you have proper smp affinity for the RX queues on your NIC ?
>
> ( Documentation/networking/scaling.txt RSS IRQ Configuration )
>
Yes - what I was trying to say is that this change will be more
efficient than using smp_call_function_single() to schedule the
tasklet to a different processor.

RSS IRQ should be set properly already. The issue here is that I'll
need to switch the queue mapping for NIC RX to a different TXQ on IFB,
which allows me to classify the flows at the IFB TXQ layer and avoid
qdisc lock contention.

When that switch happens, ideally processor core shouldn't be switched
because all the thread context isn't changed. The work in tasklet
should be scheduled to the same processor as well. That's why I tried
this change. Also conceptually IFB is a software device which should
be able to schedule its workload independent from how NIC is
configured for the interrupt handling.

> A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
> the driver queue is locked before ndo_start_xmit())  (for non
> NETIF_F_LLTX drivers at least)
>

Thanks a lot for pointing out this! I was expecting this kind of
guidance... Then the options would be:

1. Use smp_call_function_single() to schedule the tasklet to a core
statically mapped to the IFB TXQ, which is very similar to how TX/RX
IRQ is configured.
2. As you suggested below add some additional action to do the
rescheduling before entering IFB - for example when receiving the
packet we could just use RSS to redirect to the desired RXQ, however
this doesn't seem to be easy, especially compared with the way how
mqprio chooses the queue. The challenge here is that IFB queue
selection is based on queue_mapping when skb arrives at IFB and core
selection is based on RXQ on NIC and so it's also based on
queue_mapping when skb arrives at NIC. Then these two queue_mappings
must be the same so that there is no core conflict of processing two
TXQs of IFB. Then this essentially means we have to change queue
mapping of the NIC on the receiver side which can't be achieved using
TC.

> In case of ifb, __skb_queue_tail(&txp->rq, skb); could corrupt the skb
> list.
>
> In any case, you could have an action to do this before reaching IFB.
>
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Modification to skb->queue_mapping affecting performance
  2016-09-16 22:00                       ` Michael Ma
@ 2016-09-23 23:21                         ` Michael Ma
  0 siblings, 0 replies; 14+ messages in thread
From: Michael Ma @ 2016-09-23 23:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

2016-09-16 15:00 GMT-07:00 Michael Ma <make0818@gmail.com>:
> 2016-09-16 12:53 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:
>>
>>> This is actually the problem - if flows from different RX queues are
>>> switched to the same RX queue in IFB, they'll use different processor
>>> context with the same tasklet, and the processor context of different
>>> tasklets might be the same. So multiple tasklets in IFB competes for
>>> the same core when queue is switched.
>>>
>>> The following simple fix proved this - with this change even switching
>>> the queue won't affect small packet bandwidth/latency anymore:
>>>
>>> in ifb.c:
>>>
>>> -       struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
>>> +       struct ifb_q_private *txp = dp->tx_private +
>>> (smp_processor_id() % dev->num_tx_queues);
>>>
>>> This should be more efficient since we're not sending the task to a
>>> different processor, instead we try to queue the packet to an
>>> appropriate tasklet based on the processor ID. Will this cause any
>>> packet out-of-order problem? If packets from the same flow are queued
>>> to the same RX queue due to RSS, and processor affinity is set for RX
>>> queues, I assume packets from the same flow will end up in the same
>>> core when tasklet is scheduled. But I might have missed some uncommon
>>> cases here... Would appreciate if anyone can provide more insights.
>>
>> Wait, don't you have proper smp affinity for the RX queues on your NIC ?
>>
>> ( Documentation/networking/scaling.txt RSS IRQ Configuration )
>>
> Yes - what I was trying to say is that this change will be more
> efficient than using smp_call_function_single() to schedule the
> tasklet to a different processor.
>
> RSS IRQ should be set properly already. The issue here is that I'll
> need to switch the queue mapping for NIC RX to a different TXQ on IFB,
> which allows me to classify the flows at the IFB TXQ layer and avoid
> qdisc lock contention.
>
> When that switch happens, ideally processor core shouldn't be switched
> because all the thread context isn't changed. The work in tasklet
> should be scheduled to the same processor as well. That's why I tried
> this change. Also conceptually IFB is a software device which should
> be able to schedule its workload independent from how NIC is
> configured for the interrupt handling.
>
>> A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
>> the driver queue is locked before ndo_start_xmit())  (for non
>> NETIF_F_LLTX drivers at least)
>>
>
> Thanks a lot for pointing out this! I was expecting this kind of
> guidance... Then the options would be:
>
> 1. Use smp_call_function_single() to schedule the tasklet to a core
> statically mapped to the IFB TXQ, which is very similar to how TX/RX
> IRQ is configured.

This actually won't help with the throughput because ultimately load
will still be concentrated to some particular cores after packets are
concentrated to a TXQ due to queue level classification.

> 2. As you suggested below add some additional action to do the
> rescheduling before entering IFB - for example when receiving the
> packet we could just use RSS to redirect to the desired RXQ, however
> this doesn't seem to be easy, especially compared with the way how
> mqprio chooses the queue. The challenge here is that IFB queue
> selection is based on queue_mapping when skb arrives at IFB and core
> selection is based on RXQ on NIC and so it's also based on
> queue_mapping when skb arrives at NIC. Then these two queue_mappings
> must be the same so that there is no core conflict of processing two
> TXQs of IFB. Then this essentially means we have to change queue
> mapping of the NIC on the receiver side which can't be achieved using
> TC.
>

I tried to explore this further - there is actually XPS on ifb which
can be used to specify the processor cores that will be used to
process each TXQ of ifb, however the problem is similar as above -
eventually I'll have a few cores processing these queues instead of
having all the cores processing together with relatively light
contention. And this again reduces the throughput. So there isn't a
good place to do this. The ultimate problem is that we're trying to
workaround the qdisc spin lock problem by leveraging the independence
of TXQs, but at the same time after qdisc phase we also want to
maximize the utilization of cores across whatever TXQs that are used.

>> In case of ifb, __skb_queue_tail(&txp->rq, skb); could corrupt the skb
>> list.
>>
>> In any case, you could have an action to do this before reaching IFB.
>>
>>
>>

So here is another solution - for packets coming from the NIC ingress
path the context is already a tasklet and there is no need of starting
another tasklet based on the queue selected, right? All the RQ
handling and netif_tx_stop/wakeup stuff in ifb module is unnecessary
in this case. Then we can just do transmit/receive in ifb_xmit()
directly to simplify the problem here - there is no synchronization
issue for the RQ anymore since there is no RQ indeed.

Eric, would appreciate if you can share your thoughts here.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-09-23 23:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-13 22:59 Modification to skb->queue_mapping affecting performance Michael Ma
2016-09-13 23:07 ` Eric Dumazet
     [not found]   ` <CAAmHdhw6JXYtYNjcJchieW-KcwRkNnFmXTRTXrLQS_Kh+rfP2w@mail.gmail.com>
2016-09-14  0:09     ` Eric Dumazet
2016-09-14  0:23       ` Michael Ma
2016-09-14  1:18         ` Eric Dumazet
2016-09-14  5:13           ` Michael Ma
2016-09-14  5:19             ` Michael Ma
2016-09-14  5:22             ` Eric Dumazet
2016-09-14 17:46               ` Michael Ma
2016-09-16  0:51                 ` Michael Ma
2016-09-16 17:57                   ` Michael Ma
2016-09-16 19:53                     ` Eric Dumazet
2016-09-16 22:00                       ` Michael Ma
2016-09-23 23:21                         ` Michael Ma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.