All of lore.kernel.org
 help / color / mirror / Atom feed
* qdisc spin lock
@ 2016-03-30  7:20 Michael Ma
  2016-03-31 19:18 ` Jesper Dangaard Brouer
  2016-03-31 22:16 ` Cong Wang
  0 siblings, 2 replies; 21+ messages in thread
From: Michael Ma @ 2016-03-30  7:20 UTC (permalink / raw)
  To: netdev

Hi -

I know this might be an old topic so bare with me – what we are facing
is that applications are sending small packets using hundreds of
threads so the contention on spin lock in __dev_xmit_skb increases the
latency of dev_queue_xmit significantly. We’re building a network QoS
solution to avoid interference of different applications using HTB.
But in this case when some applications send massive small packets in
parallel, the application to be protected will get its throughput
affected (because it’s doing synchronous network communication using
multiple threads and throughput is sensitive to the increased latency)

Here is the profiling from perf:

-  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
                                                         - 99.94%
dev_queue_xmit
                                                              96.91%
_spin_lock
                                                            - 2.62%
__qdisc_run
                                                               -
98.98% sch_direct_xmit

99.98% _spin_lock
                                                                 1.01%
_spin_lock

As far as I understand the design of TC is to simplify locking schema
and minimize the work in __qdisc_run so that throughput won’t be
affected, especially with large packets. However if the scenario is
that multiple classes in the queueing discipline only have the shaping
limit, there isn’t really a necessary correlation between different
classes. The only synchronization point should be when the packet is
dequeued from the qdisc queue and enqueued to the transmit queue of
the device. My question is – is it worth investing on avoiding the
locking contention by partitioning the queue/lock so that this
scenario is addressed with relatively smaller latency?

I must have oversimplified a lot of details since I’m not familiar
with the TC implementation at this point – just want to get your input
in terms of whether this is a worthwhile effort or there is something
fundamental that I’m not aware of. If this is just a matter of quite
some additional work, would also appreciate helping to outline the
required work here.

Also would appreciate if there is any information about the latest
status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf

Thanks,
Ke Ma

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-30  7:20 qdisc spin lock Michael Ma
@ 2016-03-31 19:18 ` Jesper Dangaard Brouer
  2016-03-31 23:41   ` Michael Ma
  2016-03-31 22:16 ` Cong Wang
  1 sibling, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2016-03-31 19:18 UTC (permalink / raw)
  To: Michael Ma; +Cc: brouer, netdev


On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote:

> I know this might be an old topic so bare with me – what we are facing
> is that applications are sending small packets using hundreds of
> threads so the contention on spin lock in __dev_xmit_skb increases the
> latency of dev_queue_xmit significantly. We’re building a network QoS
> solution to avoid interference of different applications using HTB.

Yes, as you have noticed with HTB there is a single qdisc lock, and
congestion obviously happens :-)

It is possible with different tricks to make it scale.  I believe
Google is using a variant of HTB, and it scales for them.  They have
not open source their modifications to HTB (which likely also involves
a great deal of setup tricks).

If your purpose it to limit traffic/bandwidth per "cloud" instance,
then you can just use another TC setup structure.  Like using MQ and
assigning a HTB per MQ queue (where the MQ queues are bound to each
CPU/HW queue)... But you have to figure out this setup yourself...


> But in this case when some applications send massive small packets in
> parallel, the application to be protected will get its throughput
> affected (because it’s doing synchronous network communication using
> multiple threads and throughput is sensitive to the increased latency)
> 
> Here is the profiling from perf:
> 
> -  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
>   - 99.94% dev_queue_xmit
>   -  96.91% _spin_lock                                                           
>  - 2.62%  __qdisc_run                                                                
>   - 98.98% sch_direct_xmit
> - 99.98% _spin_lock
> 
> As far as I understand the design of TC is to simplify locking schema
> and minimize the work in __qdisc_run so that throughput won’t be
> affected, especially with large packets. However if the scenario is
> that multiple classes in the queueing discipline only have the shaping
> limit, there isn’t really a necessary correlation between different
> classes. The only synchronization point should be when the packet is
> dequeued from the qdisc queue and enqueued to the transmit queue of
> the device. My question is – is it worth investing on avoiding the
> locking contention by partitioning the queue/lock so that this
> scenario is addressed with relatively smaller latency?

Yes, there is a lot go gain, but it is not easy ;-)

> I must have oversimplified a lot of details since I’m not familiar
> with the TC implementation at this point – just want to get your input
> in terms of whether this is a worthwhile effort or there is something
> fundamental that I’m not aware of. If this is just a matter of quite
> some additional work, would also appreciate helping to outline the
> required work here.
> 
> Also would appreciate if there is any information about the latest
> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf

This article seems to be very low quality... spelling errors, only 5
pages, no real code, etc. 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-30  7:20 qdisc spin lock Michael Ma
  2016-03-31 19:18 ` Jesper Dangaard Brouer
@ 2016-03-31 22:16 ` Cong Wang
  2016-03-31 23:48   ` Michael Ma
  1 sibling, 1 reply; 21+ messages in thread
From: Cong Wang @ 2016-03-31 22:16 UTC (permalink / raw)
  To: Michael Ma; +Cc: Linux Kernel Network Developers

On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote:
> As far as I understand the design of TC is to simplify locking schema
> and minimize the work in __qdisc_run so that throughput won’t be
> affected, especially with large packets. However if the scenario is
> that multiple classes in the queueing discipline only have the shaping
> limit, there isn’t really a necessary correlation between different
> classes. The only synchronization point should be when the packet is
> dequeued from the qdisc queue and enqueued to the transmit queue of
> the device. My question is – is it worth investing on avoiding the
> locking contention by partitioning the queue/lock so that this
> scenario is addressed with relatively smaller latency?

If your HTB classes don't share bandwidth, why do you still make them
under the same hierarchy? IOW, you can just isolate them either with some
other qdisc or just separated interfaces.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-31 19:18 ` Jesper Dangaard Brouer
@ 2016-03-31 23:41   ` Michael Ma
  2016-04-16  8:52     ` Andrew
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Ma @ 2016-03-31 23:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev

Thanks for the suggestion - I'll try the MQ solution out. It seems to
be able to solve the problem well with the assumption that bandwidth
can be statically partitioned.

2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote:
>
>> I know this might be an old topic so bare with me – what we are facing
>> is that applications are sending small packets using hundreds of
>> threads so the contention on spin lock in __dev_xmit_skb increases the
>> latency of dev_queue_xmit significantly. We’re building a network QoS
>> solution to avoid interference of different applications using HTB.
>
> Yes, as you have noticed with HTB there is a single qdisc lock, and
> congestion obviously happens :-)
>
> It is possible with different tricks to make it scale.  I believe
> Google is using a variant of HTB, and it scales for them.  They have
> not open source their modifications to HTB (which likely also involves
> a great deal of setup tricks).
>
> If your purpose it to limit traffic/bandwidth per "cloud" instance,
> then you can just use another TC setup structure.  Like using MQ and
> assigning a HTB per MQ queue (where the MQ queues are bound to each
> CPU/HW queue)... But you have to figure out this setup yourself...
>
>
>> But in this case when some applications send massive small packets in
>> parallel, the application to be protected will get its throughput
>> affected (because it’s doing synchronous network communication using
>> multiple threads and throughput is sensitive to the increased latency)
>>
>> Here is the profiling from perf:
>>
>> -  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
>>   - 99.94% dev_queue_xmit
>>   -  96.91% _spin_lock
>>  - 2.62%  __qdisc_run
>>   - 98.98% sch_direct_xmit
>> - 99.98% _spin_lock
>>
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> Yes, there is a lot go gain, but it is not easy ;-)
>
>> I must have oversimplified a lot of details since I’m not familiar
>> with the TC implementation at this point – just want to get your input
>> in terms of whether this is a worthwhile effort or there is something
>> fundamental that I’m not aware of. If this is just a matter of quite
>> some additional work, would also appreciate helping to outline the
>> required work here.
>>
>> Also would appreciate if there is any information about the latest
>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
>
> This article seems to be very low quality... spelling errors, only 5
> pages, no real code, etc.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-31 22:16 ` Cong Wang
@ 2016-03-31 23:48   ` Michael Ma
  2016-04-01  2:19     ` David Miller
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Michael Ma @ 2016-03-31 23:48 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

I didn't really know that multiple qdiscs can be isolated using MQ so
that each txq can be associated with a particular qdisc. Also we don't
really have multiple interfaces...

With this MQ solution we'll still need to assign transmit queues to
different classes by doing some math on the bandwidth limit if I
understand correctly, which seems to be less convenient compared with
a solution purely within HTB.

I assume that with this solution I can still share qdisc among
multiple transmit queues - please let me know if this is not the case.

2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>:
> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote:
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> If your HTB classes don't share bandwidth, why do you still make them
> under the same hierarchy? IOW, you can just isolate them either with some
> other qdisc or just separated interfaces.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-31 23:48   ` Michael Ma
@ 2016-04-01  2:19     ` David Miller
  2016-04-01 17:17       ` Michael Ma
  2016-04-01  3:44     ` John Fastabend
  2016-04-08 14:19     ` Eric Dumazet
  2 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2016-04-01  2:19 UTC (permalink / raw)
  To: make0818; +Cc: xiyou.wangcong, netdev

From: Michael Ma <make0818@gmail.com>
Date: Thu, 31 Mar 2016 16:48:43 -0700

> I didn't really know that multiple qdiscs can be isolated using MQ so
 ...

Please stop top-posting.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-31 23:48   ` Michael Ma
  2016-04-01  2:19     ` David Miller
@ 2016-04-01  3:44     ` John Fastabend
  2016-04-13 18:23       ` Michael Ma
  2016-04-08 14:19     ` Eric Dumazet
  2 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2016-04-01  3:44 UTC (permalink / raw)
  To: Michael Ma, Cong Wang; +Cc: Linux Kernel Network Developers

On 16-03-31 04:48 PM, Michael Ma wrote:
> I didn't really know that multiple qdiscs can be isolated using MQ so
> that each txq can be associated with a particular qdisc. Also we don't
> really have multiple interfaces...

MQ will assign a default qdisc to each txq and the default qdisc can
be changed to htb or any other qdisc of your choice.

> 
> With this MQ solution we'll still need to assign transmit queues to
> different classes by doing some math on the bandwidth limit if I
> understand correctly, which seems to be less convenient compared with
> a solution purely within HTB.
> 

Agreed.

> I assume that with this solution I can still share qdisc among
> multiple transmit queues - please let me know if this is not the case.

Nope sorry doesn't work that way unless you employ some sort of stacked
netdevice strategy which does start to get a bit complex. The basic hint
would be to stack some type of virtual netdev on top of a device and
run the htb qdisc there. Push traffic onto the netdev depending on the
class it belongs to. Its ugly yes.

Noting all that I posted an RFC patch some time back to allow writing
qdiscs that do not require taking the lock. I'll try to respin these
and submit them when net-next opens again. The next logical step is to
write a "better" HTB probably using a shared counter and dropping the
requirement that it be exact.

Sorry I didn't get a chance to look at the paper in your post so not
sure if they suggest something similar or not.

Thanks,
John

> 
> 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>:
>> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote:
>>> As far as I understand the design of TC is to simplify locking schema
>>> and minimize the work in __qdisc_run so that throughput won’t be
>>> affected, especially with large packets. However if the scenario is
>>> that multiple classes in the queueing discipline only have the shaping
>>> limit, there isn’t really a necessary correlation between different
>>> classes. The only synchronization point should be when the packet is
>>> dequeued from the qdisc queue and enqueued to the transmit queue of
>>> the device. My question is – is it worth investing on avoiding the
>>> locking contention by partitioning the queue/lock so that this
>>> scenario is addressed with relatively smaller latency?
>>
>> If your HTB classes don't share bandwidth, why do you still make them
>> under the same hierarchy? IOW, you can just isolate them either with some
>> other qdisc or just separated interfaces.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-01  2:19     ` David Miller
@ 2016-04-01 17:17       ` Michael Ma
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Ma @ 2016-04-01 17:17 UTC (permalink / raw)
  To: David Miller; +Cc: Cong Wang, Linux Kernel Network Developers

2016-03-31 19:19 GMT-07:00 David Miller <davem@davemloft.net>:
> From: Michael Ma <make0818@gmail.com>
> Date: Thu, 31 Mar 2016 16:48:43 -0700
>
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>  ...
>
> Please stop top-posting.

Sorry that I wasn't aware of this...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-31 23:48   ` Michael Ma
  2016-04-01  2:19     ` David Miller
  2016-04-01  3:44     ` John Fastabend
@ 2016-04-08 14:19     ` Eric Dumazet
  2016-04-15 22:46       ` Michael Ma
  2016-04-20 21:24       ` Michael Ma
  2 siblings, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2016-04-08 14:19 UTC (permalink / raw)
  To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers

On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
> I didn't really know that multiple qdiscs can be isolated using MQ so
> that each txq can be associated with a particular qdisc. Also we don't
> really have multiple interfaces...
> 
> With this MQ solution we'll still need to assign transmit queues to
> different classes by doing some math on the bandwidth limit if I
> understand correctly, which seems to be less convenient compared with
> a solution purely within HTB.
> 
> I assume that with this solution I can still share qdisc among
> multiple transmit queues - please let me know if this is not the case.

Note that this MQ + HTB thing works well, unless you use a bonding
device. (Or you need the MQ+HTB on the slaves, with no way of sharing
tokens between the slaves)


https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3

bonding can not really be used as a true MQ device yet.

I might send a patch to disable this 'bonding feature' if no slave sets
a queue_id. 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-01  3:44     ` John Fastabend
@ 2016-04-13 18:23       ` Michael Ma
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Ma @ 2016-04-13 18:23 UTC (permalink / raw)
  To: John Fastabend; +Cc: Cong Wang, Linux Kernel Network Developers

2016-03-31 20:44 GMT-07:00 John Fastabend <john.fastabend@gmail.com>:
> On 16-03-31 04:48 PM, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>
> MQ will assign a default qdisc to each txq and the default qdisc can
> be changed to htb or any other qdisc of your choice.
>
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>
> Agreed.
>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Nope sorry doesn't work that way unless you employ some sort of stacked
> netdevice strategy which does start to get a bit complex. The basic hint
> would be to stack some type of virtual netdev on top of a device and
> run the htb qdisc there. Push traffic onto the netdev depending on the
> class it belongs to. Its ugly yes.

Thanks for the input - is it possible to implement a MQ to allow one
qdisc to be associated with multiple TXQ belonging to one inner class?
Does that require any additional synchronization in qdisc routines? I
think it shouldn't because the common use of TC already associates one
qdisc with all the TXQ. This would simplify the way that qdisc is
crafted since it doesn't need to be created for each queue.

Also is it possible to allow multiple qdiscs to be associated with one
TXQ? I can see one issue with this - the TC interface for MQ today
doesn't allow one TXQ to be associated with multiple classes, but this
seems to be a restriction only from the tc command interface
perspective. MQ anyway only establishes a connection between TXQ and
qdisc so theoretically we can associate one TXQ to multiple
classes/qdiscs and use priority/classid of the packet to do the
pre-selection of qdisc. Again, I might have missed some potential
synchronization issue when this actually happens because today TXQ is
not shared by multiple qdiscs in any case.

>
> Noting all that I posted an RFC patch some time back to allow writing
> qdiscs that do not require taking the lock. I'll try to respin these
> and submit them when net-next opens again. The next logical step is to
> write a "better" HTB probably using a shared counter and dropping the
> requirement that it be exact.
>
> Sorry I didn't get a chance to look at the paper in your post so not
> sure if they suggest something similar or not.
>
> Thanks,
> John
>
>>
>> 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>:
>>> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote:
>>>> As far as I understand the design of TC is to simplify locking schema
>>>> and minimize the work in __qdisc_run so that throughput won’t be
>>>> affected, especially with large packets. However if the scenario is
>>>> that multiple classes in the queueing discipline only have the shaping
>>>> limit, there isn’t really a necessary correlation between different
>>>> classes. The only synchronization point should be when the packet is
>>>> dequeued from the qdisc queue and enqueued to the transmit queue of
>>>> the device. My question is – is it worth investing on avoiding the
>>>> locking contention by partitioning the queue/lock so that this
>>>> scenario is addressed with relatively smaller latency?
>>>
>>> If your HTB classes don't share bandwidth, why do you still make them
>>> under the same hierarchy? IOW, you can just isolate them either with some
>>> other qdisc or just separated interfaces.
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-08 14:19     ` Eric Dumazet
@ 2016-04-15 22:46       ` Michael Ma
  2016-04-15 22:54         ` Eric Dumazet
  2016-04-20 21:24       ` Michael Ma
  1 sibling, 1 reply; 21+ messages in thread
From: Michael Ma @ 2016-04-15 22:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers

2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Note that this MQ + HTB thing works well, unless you use a bonding
> device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> tokens between the slaves)
>
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>
> bonding can not really be used as a true MQ device yet.
>
> I might send a patch to disable this 'bonding feature' if no slave sets
> a queue_id.
>
>
So there is no way of using this MQ solution when bonding interface is
used, right? Then modifying HTB might be the only solution?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-15 22:46       ` Michael Ma
@ 2016-04-15 22:54         ` Eric Dumazet
  2016-04-15 23:05           ` Michael Ma
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2016-04-15 22:54 UTC (permalink / raw)
  To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers

On Fri, 2016-04-15 at 15:46 -0700, Michael Ma wrote:
> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
> >> I didn't really know that multiple qdiscs can be isolated using MQ so
> >> that each txq can be associated with a particular qdisc. Also we don't
> >> really have multiple interfaces...
> >>
> >> With this MQ solution we'll still need to assign transmit queues to
> >> different classes by doing some math on the bandwidth limit if I
> >> understand correctly, which seems to be less convenient compared with
> >> a solution purely within HTB.
> >>
> >> I assume that with this solution I can still share qdisc among
> >> multiple transmit queues - please let me know if this is not the case.
> >
> > Note that this MQ + HTB thing works well, unless you use a bonding
> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> > tokens between the slaves)
> >
> >
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
> >
> > bonding can not really be used as a true MQ device yet.
> >
> > I might send a patch to disable this 'bonding feature' if no slave sets
> > a queue_id.
> >
> >
> So there is no way of using this MQ solution when bonding interface is
> used, right? Then modifying HTB might be the only solution?

I probably can submit a bonding patch very soon if there is interest.

Modifying HTB is way more complicated :(

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-15 22:54         ` Eric Dumazet
@ 2016-04-15 23:05           ` Michael Ma
  2016-04-15 23:56             ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Ma @ 2016-04-15 23:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers

2016-04-15 15:54 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Fri, 2016-04-15 at 15:46 -0700, Michael Ma wrote:
>> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> really have multiple interfaces...
>> >>
>> >> With this MQ solution we'll still need to assign transmit queues to
>> >> different classes by doing some math on the bandwidth limit if I
>> >> understand correctly, which seems to be less convenient compared with
>> >> a solution purely within HTB.
>> >>
>> >> I assume that with this solution I can still share qdisc among
>> >> multiple transmit queues - please let me know if this is not the case.
>> >
>> > Note that this MQ + HTB thing works well, unless you use a bonding
>> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> > tokens between the slaves)
>> >
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>> >
>> > bonding can not really be used as a true MQ device yet.
>> >
>> > I might send a patch to disable this 'bonding feature' if no slave sets
>> > a queue_id.
>> >
>> >
>> So there is no way of using this MQ solution when bonding interface is
>> used, right? Then modifying HTB might be the only solution?
>
> I probably can submit a bonding patch very soon if there is interest.

Would definitely appreciate that. If you can share the patch it will
be helpful as well. Let me know if I can help with this...
>
> Modifying HTB is way more complicated :(
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-15 23:05           ` Michael Ma
@ 2016-04-15 23:56             ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2016-04-15 23:56 UTC (permalink / raw)
  To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers

On Fri, 2016-04-15 at 16:05 -0700, Michael Ma wrote:
> Would definitely appreciate that. If you can share the patch it will
> be helpful as well. Let me know if I can help with this...

Sure, here is what I am going to test :

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 941ec99cd3b6..f890ebe78629 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4003,11 +4003,17 @@ static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb,
 	 * skb_tx_hash and will put the skbs in the queue we expect on their
 	 * way down to the bonding driver.
 	 */
-	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+	struct bonding *bond = netdev_priv(dev);
+	u16 txq;
 
 	/* Save the original txq to restore before passing to the driver */
 	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
 
+	if (!bond->queue_id_selection_enabled)
+		return fallback(dev, skb);
+
+	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
 	if (unlikely(txq >= dev->real_num_tx_queues)) {
 		do {
 			txq -= dev->real_num_tx_queues;
@@ -4020,7 +4026,8 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
 {
 	struct bonding *bond = netdev_priv(dev);
 
-	if (bond_should_override_tx_queue(bond) &&
+	if (bond->queue_id_selection_enabled &&
+	    bond_should_override_tx_queue(bond) &&
 	    !bond_slave_override(bond, skb))
 		return NETDEV_TX_OK;
 
diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 577e57cad1dc..1a8a5153f027 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -1270,6 +1270,7 @@ static int bond_option_ad_select_set(struct bonding *bond,
 static int bond_option_queue_id_set(struct bonding *bond,
 				    const struct bond_opt_value *newval)
 {
+	bool queue_id_selection_enabled = false;
 	struct slave *slave, *update_slave;
 	struct net_device *sdev;
 	struct list_head *iter;
@@ -1318,6 +1319,12 @@ static int bond_option_queue_id_set(struct bonding *bond,
 	/* Actually set the qids for the slave */
 	update_slave->queue_id = qid;
 
+	bond_for_each_slave(bond, slave, iter) {
+		if (update_slave->queue_id)
+			queue_id_selection_enabled = true;
+	}
+	bond->queue_id_selection_enabled = queue_id_selection_enabled;
+
 out:
 	return ret;
 
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 791800ddd6d9..29e29491c571 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -204,6 +204,7 @@ struct bonding {
 	struct   slave __rcu *primary_slave;
 	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
 	bool     force_primary;
+	bool	 queue_id_selection_enabled;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
 			      struct slave *);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-03-31 23:41   ` Michael Ma
@ 2016-04-16  8:52     ` Andrew
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew @ 2016-04-16  8:52 UTC (permalink / raw)
  To: Michael Ma, Jesper Dangaard Brouer; +Cc: netdev

I think that it isn't a good solution - unless you can bind specified 
host (src/dst) to specified txq. Usually traffic is spreaded on txqs by 
src+dst IP (or even IP:port) hash which results in traffic spreading 
among all mqs on device, and wrong bandwidth limiting (N*bandwidth on 
multi-session load like p2p/server traffic)...

People said that hfsc shaper have better performance, but I didn't 
tested it.

01.04.2016 02:41, Michael Ma пишет:
> Thanks for the suggestion - I'll try the MQ solution out. It seems to
> be able to solve the problem well with the assumption that bandwidth
> can be statically partitioned.
>
> 2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote:
>>
>>> I know this might be an old topic so bare with me – what we are facing
>>> is that applications are sending small packets using hundreds of
>>> threads so the contention on spin lock in __dev_xmit_skb increases the
>>> latency of dev_queue_xmit significantly. We’re building a network QoS
>>> solution to avoid interference of different applications using HTB.
>> Yes, as you have noticed with HTB there is a single qdisc lock, and
>> congestion obviously happens :-)
>>
>> It is possible with different tricks to make it scale.  I believe
>> Google is using a variant of HTB, and it scales for them.  They have
>> not open source their modifications to HTB (which likely also involves
>> a great deal of setup tricks).
>>
>> If your purpose it to limit traffic/bandwidth per "cloud" instance,
>> then you can just use another TC setup structure.  Like using MQ and
>> assigning a HTB per MQ queue (where the MQ queues are bound to each
>> CPU/HW queue)... But you have to figure out this setup yourself...
>>
>>
>>> But in this case when some applications send massive small packets in
>>> parallel, the application to be protected will get its throughput
>>> affected (because it’s doing synchronous network communication using
>>> multiple threads and throughput is sensitive to the increased latency)
>>>
>>> Here is the profiling from perf:
>>>
>>> -  67.57%   iperf  [kernel.kallsyms]     [k] _spin_lock
>>>    - 99.94% dev_queue_xmit
>>>    -  96.91% _spin_lock
>>>   - 2.62%  __qdisc_run
>>>    - 98.98% sch_direct_xmit
>>> - 99.98% _spin_lock
>>>
>>> As far as I understand the design of TC is to simplify locking schema
>>> and minimize the work in __qdisc_run so that throughput won’t be
>>> affected, especially with large packets. However if the scenario is
>>> that multiple classes in the queueing discipline only have the shaping
>>> limit, there isn’t really a necessary correlation between different
>>> classes. The only synchronization point should be when the packet is
>>> dequeued from the qdisc queue and enqueued to the transmit queue of
>>> the device. My question is – is it worth investing on avoiding the
>>> locking contention by partitioning the queue/lock so that this
>>> scenario is addressed with relatively smaller latency?
>> Yes, there is a lot go gain, but it is not easy ;-)
>>
>>> I must have oversimplified a lot of details since I’m not familiar
>>> with the TC implementation at this point – just want to get your input
>>> in terms of whether this is a worthwhile effort or there is something
>>> fundamental that I’m not aware of. If this is just a matter of quite
>>> some additional work, would also appreciate helping to outline the
>>> required work here.
>>>
>>> Also would appreciate if there is any information about the latest
>>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
>> This article seems to be very low quality... spelling errors, only 5
>> pages, no real code, etc.
>>
>> --
>> Best regards,
>>    Jesper Dangaard Brouer
>>    MSc.CS, Principal Kernel Engineer at Red Hat
>>    Author of http://www.iptv-analyzer.org
>>    LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-08 14:19     ` Eric Dumazet
  2016-04-15 22:46       ` Michael Ma
@ 2016-04-20 21:24       ` Michael Ma
  2016-04-20 22:34         ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Michael Ma @ 2016-04-20 21:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers

2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Note that this MQ + HTB thing works well, unless you use a bonding
> device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> tokens between the slaves)

Actually MQ+HTB works well for small packets - like flow of 512 byte
packets can be throttled by HTB using one txq without being affected
by other flows with small packets. However I found using this solution
large packets (10k for example) will only achieve very limited
bandwidth. In my test I used MQ to assign one txq to a HTB which sets
rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
using 30 threads. But sending 10k packets using 10 threads has only 10
Mbit/s with the same TC configuration. If I increase burst and cburst
of HTB to some extreme large value (like 50MB) the ceiling rate can be
hit.

The strange thing is that I don't see this problem when using HTB as
the root. So txq number seems to be a factor here - however it's
really hard to understand why would it only affect larger packets. Is
this a known issue? Any suggestion on how to investigate the issue
further? Profiling shows that the cpu utilization is pretty low.

>
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>
> bonding can not really be used as a true MQ device yet.
>
> I might send a patch to disable this 'bonding feature' if no slave sets
> a queue_id.
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-20 21:24       ` Michael Ma
@ 2016-04-20 22:34         ` Eric Dumazet
  2016-04-21  5:51           ` Michael Ma
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2016-04-20 22:34 UTC (permalink / raw)
  To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers

On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
> >> I didn't really know that multiple qdiscs can be isolated using MQ so
> >> that each txq can be associated with a particular qdisc. Also we don't
> >> really have multiple interfaces...
> >>
> >> With this MQ solution we'll still need to assign transmit queues to
> >> different classes by doing some math on the bandwidth limit if I
> >> understand correctly, which seems to be less convenient compared with
> >> a solution purely within HTB.
> >>
> >> I assume that with this solution I can still share qdisc among
> >> multiple transmit queues - please let me know if this is not the case.
> >
> > Note that this MQ + HTB thing works well, unless you use a bonding
> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> > tokens between the slaves)
> 
> Actually MQ+HTB works well for small packets - like flow of 512 byte
> packets can be throttled by HTB using one txq without being affected
> by other flows with small packets. However I found using this solution
> large packets (10k for example) will only achieve very limited
> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
> using 30 threads. But sending 10k packets using 10 threads has only 10
> Mbit/s with the same TC configuration. If I increase burst and cburst
> of HTB to some extreme large value (like 50MB) the ceiling rate can be
> hit.
> 
> The strange thing is that I don't see this problem when using HTB as
> the root. So txq number seems to be a factor here - however it's
> really hard to understand why would it only affect larger packets. Is
> this a known issue? Any suggestion on how to investigate the issue
> further? Profiling shows that the cpu utilization is pretty low.

You could try 

perf record -a -g -e skb:kfree_skb sleep 5
perf report

So that you see where the packets are dropped.

Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
dropped at qdisc enqueue time, instead of having backpressure.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-20 22:34         ` Eric Dumazet
@ 2016-04-21  5:51           ` Michael Ma
  2016-04-21 12:41             ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Ma @ 2016-04-21  5:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers

2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> really have multiple interfaces...
>> >>
>> >> With this MQ solution we'll still need to assign transmit queues to
>> >> different classes by doing some math on the bandwidth limit if I
>> >> understand correctly, which seems to be less convenient compared with
>> >> a solution purely within HTB.
>> >>
>> >> I assume that with this solution I can still share qdisc among
>> >> multiple transmit queues - please let me know if this is not the case.
>> >
>> > Note that this MQ + HTB thing works well, unless you use a bonding
>> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> > tokens between the slaves)
>>
>> Actually MQ+HTB works well for small packets - like flow of 512 byte
>> packets can be throttled by HTB using one txq without being affected
>> by other flows with small packets. However I found using this solution
>> large packets (10k for example) will only achieve very limited
>> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>> using 30 threads. But sending 10k packets using 10 threads has only 10
>> Mbit/s with the same TC configuration. If I increase burst and cburst
>> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>> hit.
>>
>> The strange thing is that I don't see this problem when using HTB as
>> the root. So txq number seems to be a factor here - however it's
>> really hard to understand why would it only affect larger packets. Is
>> this a known issue? Any suggestion on how to investigate the issue
>> further? Profiling shows that the cpu utilization is pretty low.
>
> You could try
>
> perf record -a -g -e skb:kfree_skb sleep 5
> perf report
>
> So that you see where the packets are dropped.
>
> Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
> dropped at qdisc enqueue time, instead of having backpressure.
>

Thanks for the hint - how should I read the perf report? Also we're
using TCP socket in this testing - TCP window size is set to 70kB.

-  35.88%             init  [kernel.kallsyms]  [k] intel_idle
                                                   ◆
     intel_idle
                                                   ▒
-  15.83%          strings  libc-2.5.so        [.]
__GI___connect_internal
▒
   - __GI___connect_internal
                                                   ▒
      - 50.00% get_mapping
                                                   ▒
           __nscd_get_map_ref
                                                   ▒
        50.00% __nscd_open_socket
                                                   ▒
-  13.19%          strings  libc-2.5.so        [.] __GI___libc_recvmsg
                                                   ▒
   - __GI___libc_recvmsg
                                                   ▒
      + 64.52% getifaddrs
                                                   ▒
      + 35.48% __check_pf
                                                   ▒
-  10.55%          strings  libc-2.5.so        [.] __sendto_nocancel
                                                   ▒
   - __sendto_nocancel
                                                   ▒
        100.00% 0
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-21  5:51           ` Michael Ma
@ 2016-04-21 12:41             ` Eric Dumazet
  2016-04-21 22:12               ` Michael Ma
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2016-04-21 12:41 UTC (permalink / raw)
  To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers

On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote:
> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so
> >> >> that each txq can be associated with a particular qdisc. Also we don't
> >> >> really have multiple interfaces...
> >> >>
> >> >> With this MQ solution we'll still need to assign transmit queues to
> >> >> different classes by doing some math on the bandwidth limit if I
> >> >> understand correctly, which seems to be less convenient compared with
> >> >> a solution purely within HTB.
> >> >>
> >> >> I assume that with this solution I can still share qdisc among
> >> >> multiple transmit queues - please let me know if this is not the case.
> >> >
> >> > Note that this MQ + HTB thing works well, unless you use a bonding
> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> >> > tokens between the slaves)
> >>
> >> Actually MQ+HTB works well for small packets - like flow of 512 byte
> >> packets can be throttled by HTB using one txq without being affected
> >> by other flows with small packets. However I found using this solution
> >> large packets (10k for example) will only achieve very limited
> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
> >> using 30 threads. But sending 10k packets using 10 threads has only 10
> >> Mbit/s with the same TC configuration. If I increase burst and cburst
> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be
> >> hit.
> >>
> >> The strange thing is that I don't see this problem when using HTB as
> >> the root. So txq number seems to be a factor here - however it's
> >> really hard to understand why would it only affect larger packets. Is
> >> this a known issue? Any suggestion on how to investigate the issue
> >> further? Profiling shows that the cpu utilization is pretty low.
> >
> > You could try
> >
> > perf record -a -g -e skb:kfree_skb sleep 5
> > perf report
> >
> > So that you see where the packets are dropped.
> >
> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
> > dropped at qdisc enqueue time, instead of having backpressure.
> >
> 
> Thanks for the hint - how should I read the perf report? Also we're
> using TCP socket in this testing - TCP window size is set to 70kB.

But how are you telling TCP to send 10k packets ?

AFAIK you can not : TCP happily aggregates packets in write queue
(see current MSG_EOR discussion)

I suspect a bug in your tc settings.
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-21 12:41             ` Eric Dumazet
@ 2016-04-21 22:12               ` Michael Ma
  2016-04-25 17:29                 ` Michael Ma
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Ma @ 2016-04-21 22:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers

2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote:
>> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> >> really have multiple interfaces...
>> >> >>
>> >> >> With this MQ solution we'll still need to assign transmit queues to
>> >> >> different classes by doing some math on the bandwidth limit if I
>> >> >> understand correctly, which seems to be less convenient compared with
>> >> >> a solution purely within HTB.
>> >> >>
>> >> >> I assume that with this solution I can still share qdisc among
>> >> >> multiple transmit queues - please let me know if this is not the case.
>> >> >
>> >> > Note that this MQ + HTB thing works well, unless you use a bonding
>> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> >> > tokens between the slaves)
>> >>
>> >> Actually MQ+HTB works well for small packets - like flow of 512 byte
>> >> packets can be throttled by HTB using one txq without being affected
>> >> by other flows with small packets. However I found using this solution
>> >> large packets (10k for example) will only achieve very limited
>> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>> >> using 30 threads. But sending 10k packets using 10 threads has only 10
>> >> Mbit/s with the same TC configuration. If I increase burst and cburst
>> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>> >> hit.
>> >>
>> >> The strange thing is that I don't see this problem when using HTB as
>> >> the root. So txq number seems to be a factor here - however it's
>> >> really hard to understand why would it only affect larger packets. Is
>> >> this a known issue? Any suggestion on how to investigate the issue
>> >> further? Profiling shows that the cpu utilization is pretty low.
>> >
>> > You could try
>> >
>> > perf record -a -g -e skb:kfree_skb sleep 5
>> > perf report
>> >
>> > So that you see where the packets are dropped.
>> >
>> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
>> > dropped at qdisc enqueue time, instead of having backpressure.
>> >
>>
>> Thanks for the hint - how should I read the perf report? Also we're
>> using TCP socket in this testing - TCP window size is set to 70kB.
>
> But how are you telling TCP to send 10k packets ?
>
We just write to the socket with 10k buffer and wait for a response
from the server (using read()) before the next write. Using tcpdump I
can see the 10k write is actually sent through 3 packets
(7.3k/1.5k/1.3k).

> AFAIK you can not : TCP happily aggregates packets in write queue
> (see current MSG_EOR discussion)
>
> I suspect a bug in your tc settings.
>
>

Could you help to check my tc setting?

sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0
queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0
sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10
sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit

I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: qdisc spin lock
  2016-04-21 22:12               ` Michael Ma
@ 2016-04-25 17:29                 ` Michael Ma
  0 siblings, 0 replies; 21+ messages in thread
From: Michael Ma @ 2016-04-25 17:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers

2016-04-21 15:12 GMT-07:00 Michael Ma <make0818@gmail.com>:
> 2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote:
>>> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>>> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>>> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>:
>>> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>>> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>>> >> >> that each txq can be associated with a particular qdisc. Also we don't
>>> >> >> really have multiple interfaces...
>>> >> >>
>>> >> >> With this MQ solution we'll still need to assign transmit queues to
>>> >> >> different classes by doing some math on the bandwidth limit if I
>>> >> >> understand correctly, which seems to be less convenient compared with
>>> >> >> a solution purely within HTB.
>>> >> >>
>>> >> >> I assume that with this solution I can still share qdisc among
>>> >> >> multiple transmit queues - please let me know if this is not the case.
>>> >> >
>>> >> > Note that this MQ + HTB thing works well, unless you use a bonding
>>> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>>> >> > tokens between the slaves)
>>> >>
>>> >> Actually MQ+HTB works well for small packets - like flow of 512 byte
>>> >> packets can be throttled by HTB using one txq without being affected
>>> >> by other flows with small packets. However I found using this solution
>>> >> large packets (10k for example) will only achieve very limited
>>> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>>> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>>> >> using 30 threads. But sending 10k packets using 10 threads has only 10
>>> >> Mbit/s with the same TC configuration. If I increase burst and cburst
>>> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>>> >> hit.
>>> >>
>>> >> The strange thing is that I don't see this problem when using HTB as
>>> >> the root. So txq number seems to be a factor here - however it's
>>> >> really hard to understand why would it only affect larger packets. Is
>>> >> this a known issue? Any suggestion on how to investigate the issue
>>> >> further? Profiling shows that the cpu utilization is pretty low.
>>> >
>>> > You could try
>>> >
>>> > perf record -a -g -e skb:kfree_skb sleep 5
>>> > perf report
>>> >
>>> > So that you see where the packets are dropped.
>>> >
>>> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
>>> > dropped at qdisc enqueue time, instead of having backpressure.
>>> >
>>>
>>> Thanks for the hint - how should I read the perf report? Also we're
>>> using TCP socket in this testing - TCP window size is set to 70kB.
>>
>> But how are you telling TCP to send 10k packets ?
>>
> We just write to the socket with 10k buffer and wait for a response
> from the server (using read()) before the next write. Using tcpdump I
> can see the 10k write is actually sent through 3 packets
> (7.3k/1.5k/1.3k).
>
>> AFAIK you can not : TCP happily aggregates packets in write queue
>> (see current MSG_EOR discussion)
>>
>> I suspect a bug in your tc settings.
>>
>>
>
> Could you help to check my tc setting?
>
> sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0
> queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0
> sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10
> sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit
>
> I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used.

Just to circle back on this - it seems there is 200ms delay sometimes
during data push which stalled the sending:

01:34:44.046232 IP (tos 0x0, ttl  64, id 2863, offset 0, flags [DF],
proto: TCP (6), length: 8740) 10.101.197.75.59126 >
10.101.197.105.redwood-broker: . 250025:258713(8688) ack 1901 win 58
<nop,nop,timestamp 507571833 196626529>
01:34:44.046304 IP (tos 0x0, ttl  64, id 15420, offset 0, flags [DF],
proto: TCP (6), length: 52) 10.101.197.105.redwood-broker >
10.101.197.75.59126: ., cksum 0x187d (correct), 1901:1901(0) ack
258713 win 232 <nop,nop,timestamp 196626529 507571833>
01:34:44.247184 IP (tos 0x0, ttl  64, id 2869, offset 0, flags [DF],
proto: TCP (6), length: 1364) 10.101.197.75.59126 >
10.101.197.105.redwood-broker: P 258713:260025(1312) ack 1901 win 58
<nop,nop,timestamp 507571833 196626529>
01:34:44.247186 IP (tos 0x0, ttl  64, id 2870, offset 0, flags [DF],
proto: TCP (6), length: 1364) 10.101.197.75.59126 >
10.101.197.105.redwood-broker: P 258713:260025(1312) ack 1901 win 58
<nop,nop,timestamp 507572034 196626529>

at 44.046s there was an ack from the iperf server (10.101.197.105) for
a previous sent packet of size 8740, then after exact 200 ms (44.247s
above) two identical packets were pushed from the client
(10.101.197.75). It looks like there is some TCP timer triggered -
however disabling Nagel or delayed ack doesn't help. So maybe TC has
delayed the first packet and for some reason only after 200 ms seconds
the packet was sent together with the retransmitted one.

As I mentioned before, setting burst/cburst to 50MB eliminates this
problem. Also setting the TCP receive window on server side to some
value from 4k to 12k solved the issue - but from TCPDump this might
just caused the packet to be segmented further so it's not gated
significantly by HTB. Using TCP window auto-scaling doesn't help.

It all looks like HTB delayed a packet when the rate limit is hit (did
manual computation and found the timing matches) and instead of
sending it through a TC timer (which should be much less than 200ms -
100ms?), the packet was sent when TCP decides to retransmit the same
packet.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-04-25 17:29 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-30  7:20 qdisc spin lock Michael Ma
2016-03-31 19:18 ` Jesper Dangaard Brouer
2016-03-31 23:41   ` Michael Ma
2016-04-16  8:52     ` Andrew
2016-03-31 22:16 ` Cong Wang
2016-03-31 23:48   ` Michael Ma
2016-04-01  2:19     ` David Miller
2016-04-01 17:17       ` Michael Ma
2016-04-01  3:44     ` John Fastabend
2016-04-13 18:23       ` Michael Ma
2016-04-08 14:19     ` Eric Dumazet
2016-04-15 22:46       ` Michael Ma
2016-04-15 22:54         ` Eric Dumazet
2016-04-15 23:05           ` Michael Ma
2016-04-15 23:56             ` Eric Dumazet
2016-04-20 21:24       ` Michael Ma
2016-04-20 22:34         ` Eric Dumazet
2016-04-21  5:51           ` Michael Ma
2016-04-21 12:41             ` Eric Dumazet
2016-04-21 22:12               ` Michael Ma
2016-04-25 17:29                 ` Michael Ma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.