All of lore.kernel.org
 help / color / mirror / Atom feed
* bond + tc regression ?
@ 2009-05-05 15:45 Vladimir Ivashchenko
  2009-05-05 16:25 ` Denys Fedoryschenko
  2009-05-05 16:31 ` Eric Dumazet
  0 siblings, 2 replies; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-05 15:45 UTC (permalink / raw)
  To: netdev

Hi,

I have a traffic policing setup running on Linux, serving about 800 mbps
of traffic. Due to the traffic growth I decided to employ network
interface bonding to scale over a single GigE.

The Sun X4150 server has 2xIntel E5450 QuadCore CPUs and a total of four
built-in e1000e interfaces, which I grouped into two bond interfaces.

With kernel 2.6.23.1, everything works fine, but the system locked up
after a few days.

With kernel 2.6.28.7/2.6.29.1, I get 10-20% packet loss. I get packet loss as
soon as I put a classful qdisc, even prio, without even having any
classes or filters. TC prio statistics report lots of drops, around 10k
per sec. With exactly the same setup on 2.6.23, the number of drops is
only 50 per sec.

On both kernels, the system is running with at least 70% idle CPU.
The network interrupts are distributed accross the cores.

I thought it was a e1000e driver issue, but tweaking e1000e ring buffers
didn't help. I tried using e1000 on 2.6.28 by adding necessary PCI IDs,
I tried running on a different server with bnx cards, I tried disabling
NO_HZ and HRTICK, but still I have the same problem.

However, if I don't utilize bond, but just apply rules on normal ethX
interfaces, there is no packet loss with 2.6.28/29. 

So, the problem appears only when I use 2.6.28/29 + bond + classful tc
combination. 

Any ideas ?

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 15:45 bond + tc regression ? Vladimir Ivashchenko
@ 2009-05-05 16:25 ` Denys Fedoryschenko
  2009-05-05 16:31 ` Eric Dumazet
  1 sibling, 0 replies; 27+ messages in thread
From: Denys Fedoryschenko @ 2009-05-05 16:25 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

Can you show example of rules you are putting?
Probably i can find mistakes and give correct example and i will explain maybe 
why it is happened.


On Tuesday 05 May 2009 18:45:58 Vladimir Ivashchenko wrote:
> Hi,
>
> I have a traffic policing setup running on Linux, serving about 800 mbps
> of traffic. Due to the traffic growth I decided to employ network
> interface bonding to scale over a single GigE.
>
> The Sun X4150 server has 2xIntel E5450 QuadCore CPUs and a total of four
> built-in e1000e interfaces, which I grouped into two bond interfaces.
>
> With kernel 2.6.23.1, everything works fine, but the system locked up
> after a few days.
>
> With kernel 2.6.28.7/2.6.29.1, I get 10-20% packet loss. I get packet loss
> as soon as I put a classful qdisc, even prio, without even having any
> classes or filters. TC prio statistics report lots of drops, around 10k per
> sec. With exactly the same setup on 2.6.23, the number of drops is only 50
> per sec.
>
> On both kernels, the system is running with at least 70% idle CPU.
> The network interrupts are distributed accross the cores.
>
> I thought it was a e1000e driver issue, but tweaking e1000e ring buffers
> didn't help. I tried using e1000 on 2.6.28 by adding necessary PCI IDs,
> I tried running on a different server with bnx cards, I tried disabling
> NO_HZ and HRTICK, but still I have the same problem.
>
> However, if I don't utilize bond, but just apply rules on normal ethX
> interfaces, there is no packet loss with 2.6.28/29.
>
> So, the problem appears only when I use 2.6.28/29 + bond + classful tc
> combination.
>
> Any ideas ?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 15:45 bond + tc regression ? Vladimir Ivashchenko
  2009-05-05 16:25 ` Denys Fedoryschenko
@ 2009-05-05 16:31 ` Eric Dumazet
  2009-05-05 17:41   ` Vladimir Ivashchenko
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2009-05-05 16:31 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

Vladimir Ivashchenko a écrit :
> Hi,
> 
> I have a traffic policing setup running on Linux, serving about 800 mbps
> of traffic. Due to the traffic growth I decided to employ network
> interface bonding to scale over a single GigE.
> 
> The Sun X4150 server has 2xIntel E5450 QuadCore CPUs and a total of four
> built-in e1000e interfaces, which I grouped into two bond interfaces.
> 
> With kernel 2.6.23.1, everything works fine, but the system locked up
> after a few days.
> 
> With kernel 2.6.28.7/2.6.29.1, I get 10-20% packet loss. I get packet loss as
> soon as I put a classful qdisc, even prio, without even having any
> classes or filters. TC prio statistics report lots of drops, around 10k
> per sec. With exactly the same setup on 2.6.23, the number of drops is
> only 50 per sec.
> 
> On both kernels, the system is running with at least 70% idle CPU.
> The network interrupts are distributed accross the cores.

You should not distribute interrupts, but bound a NIC to one CPU
> 
> I thought it was a e1000e driver issue, but tweaking e1000e ring buffers
> didn't help. I tried using e1000 on 2.6.28 by adding necessary PCI IDs,
> I tried running on a different server with bnx cards, I tried disabling
> NO_HZ and HRTICK, but still I have the same problem.
> 
> However, if I don't utilize bond, but just apply rules on normal ethX
> interfaces, there is no packet loss with 2.6.28/29. 
> 
> So, the problem appears only when I use 2.6.28/29 + bond + classful tc
> combination. 
> 
> Any ideas ?
> 

Yes, we need much more information :)
Is it a forwarding setup only ?

cat /proc/interrupts
cat /proc/net/bonding/bond0
cat /proc/net/bonding/bond1
tc -s -d qdisc
mpstat -P ALL 10
ifconfig -a

and so on ...



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 16:31 ` Eric Dumazet
@ 2009-05-05 17:41   ` Vladimir Ivashchenko
  2009-05-05 18:50     ` Eric Dumazet
  2009-05-06  6:10     ` Jarek Poplawski
  0 siblings, 2 replies; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-05 17:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

> > On both kernels, the system is running with at least 70% idle CPU.
> > The network interrupts are distributed accross the cores.
> 
> You should not distribute interrupts, but bound a NIC to one CPU

Kernels 2.6.28 and 2.6.29 do this by default, so I thought its correct.
The defaults are wrong?

I have tried with IRQs bound to one CPU per NIC. Same result.

> > I thought it was a e1000e driver issue, but tweaking e1000e ring buffers
> > didn't help. I tried using e1000 on 2.6.28 by adding necessary PCI IDs,
> > I tried running on a different server with bnx cards, I tried disabling
> > NO_HZ and HRTICK, but still I have the same problem.
> > 
> > However, if I don't utilize bond, but just apply rules on normal ethX
> > interfaces, there is no packet loss with 2.6.28/29. 
> > 
> > So, the problem appears only when I use 2.6.28/29 + bond + classful tc
> > combination. 
> > 
> > Any ideas ?
> > 
> 
> Yes, we need much more information :)
> Is it a forwarding setup only ?

Yes, the server is doing nothing else but forwarding, no iptables.

> cat /proc/interrupts

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  0:        130          0          0          0          0          0          0          0   IO-APIC-edge      timer
  1:          2          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  3:          0          0          0          1          0          1          0          0   IO-APIC-edge
  4:          0          0          1          0          0          0          1          0   IO-APIC-edge
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          4          0          0          0          0          0          0          0   IO-APIC-edge      i8042
 14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ata_piix
 15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ata_piix
 17:      30901      31910      31446      30655      31618      30550      31543      30958   IO-APIC-fasteoi   aacraid
 20:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 21:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb5, ahci
 22:     298387     297642     295508     294368     295533     295430     295275     296036   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
 23:      10868      10926      10980      10738      10939      10615      10761      10909   IO-APIC-fasteoi   uhci_hcd:usb3
 57: 1486251823 1486835830 1486677250 1487105983 1488000303 1485941815 1487728317 1486624997   PCI-MSI-edge      eth0
 58: 1510676329 1509708161 1510347202 1509969755 1508599471 1511220118 1509094578 1509727616   PCI-MSI-edge      eth1
 59: 1482578890 1483618556 1482963700 1483164528 1484561615 1482130645 1484116749 1483557717   PCI-MSI-edge      eth2
 60: 1507341647 1506685822 1506862759 1506612818 1505689367 1507559672 1505911622 1506940613   PCI-MSI-edge      eth3
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC: 1020533656 1020535165 1020533613 1020534967 1020535173 1020534409 1020534985 1020534220   Local timer interrupts
RES:      18605      21215      15957      18637      22429      19493      16649      15589   Rescheduling interrupts
CAL:        160        214        186        185        199        205        190        180   Function call interrupts
TLB:     259515     264126     309016     312222     263163     265601     306189     305430   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
ERR:          0
MIS:          0

> tc -s -d qdisc

For test sake, I just put "tc qdisc add dev $IFACE root handle 1: prio" and no filters at all. 
I get the same with HTB "tc qdisc add dev $IFACE root handle 1: htb default 99" and no subclasses.

qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 13287736273644 bytes 1263672018 pkt (dropped 0, overlimits 0 requeues 2928480094)
 rate 0bit 0pps backlog 0b 0p requeues 2928480094
qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 40064376195000 bytes 1747026586 pkt (dropped 0, overlimits 0 requeues 463621814)
 rate 0bit 0pps backlog 0b 0p requeues 463621814
qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 13350145517965 bytes 1350897201 pkt (dropped 0, overlimits 0 requeues 2930879507)
 rate 0bit 0pps backlog 0b 0p requeues 2930879507
qdisc pfifo_fast 0: dev eth3 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 40193456126884 bytes 1950653764 pkt (dropped 0, overlimits 0 requeues 465511120)
 rate 0bit 0pps backlog 0b 0p requeues 465511120
qdisc prio 1: dev bond0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 985164834 bytes 2720991 pkt (dropped 241834, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc prio 1: dev bond1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 2347118738 bytes 3089171 pkt (dropped 304601, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

** Drops on bond0/bond1 are increasing by approximately 5000 per second:

qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 13287874353796 bytes 1264050808 pkt (dropped 0, overlimits 0 requeues 2928520779)
 rate 0bit 0pps backlog 0b 0p requeues 2928520779
qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 40064706826018 bytes 1747459793 pkt (dropped 0, overlimits 0 requeues 463669610)
 rate 0bit 0pps backlog 0b 0p requeues 463669610
qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 13350283202695 bytes 1351277761 pkt (dropped 0, overlimits 0 requeues 2930918488)
 rate 0bit 0pps backlog 0b 0p requeues 2930918488
qdisc pfifo_fast 0: dev eth3 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 40193784868074 bytes 1951084029 pkt (dropped 0, overlimits 0 requeues 465558015)
 rate 0bit 0pps backlog 0b 0p requeues 465558015
qdisc prio 1: dev bond0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 1260929539 bytes 3480340 pkt (dropped 311145, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc prio 1: dev bond1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 3006490946 bytes 3952643 pkt (dropped 396850, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

With same setup on 2.6.23, drops are increasing only by 50/sec or so.

As soon as I do "tc qdisc del dev $IFACE root", packet loss stops.

> cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 80
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 17
        Partner Key: 4
        Partner Mac Address: 00:19:e7:b2:07:80

Slave Interface: eth0
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:24:bd:e9:cc
Aggregator ID: 1

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:24:bd:e9:ce
Aggregator ID: 1

> cat /proc/net/bonding/bond1

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 80
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
        Aggregator ID: 2
        Number of ports: 2
        Actor Key: 17
        Partner Key: 5
        Partner Mac Address: 00:19:e7:b2:07:80

Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1b:24:bd:e9:cd
Aggregator ID: 2

Slave Interface: eth3
MII Status: up
Link Failure Count: 2
Permanent HW addr: 00:1b:24:bd:e9:cf
Aggregator ID: 2


> mpstat -P ALL 10

08:04:36 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
08:04:46 PM  all    0.00    0.00    0.01    0.00    0.00    1.05    0.00   98.94  70525.73
08:04:46 PM    0    0.00    0.00    0.00    0.00    0.00    0.70    0.00   99.30   7814.41
08:04:46 PM    1    0.00    0.00    0.00    0.00    0.00    2.10    0.00   97.90   7814.41
08:04:46 PM    2    0.00    0.00    0.00    0.00    0.00    0.20    0.00   99.80   7814.41
08:04:46 PM    3    0.00    0.00    0.10    0.00    0.00    1.30    0.00   98.60   7814.51
08:04:46 PM    4    0.00    0.00    0.00    0.00    0.00    0.50    0.00   99.50   7814.41
08:04:46 PM    5    0.00    0.00    0.00    0.00    0.00    1.90    0.00   98.10   7814.41
08:04:46 PM    6    0.00    0.00    0.00    0.00    0.00    0.60    0.00   99.40   7814.41
08:04:46 PM    7    0.00    0.00    0.10    0.00    0.00    0.90    0.00   99.00   7814.51
08:04:46 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00

08:04:46 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
08:04:56 PM  all    0.00    0.00    0.01    0.00    0.00    1.49    0.00   98.50  66429.30
08:04:56 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   7303.50
08:04:56 PM    1    0.00    0.00    0.00    0.00    0.00    1.60    0.00   98.40   7303.50
08:04:56 PM    2    0.00    0.00    0.00    0.00    0.00    1.20    0.00   98.80   7303.50
08:04:56 PM    3    0.00    0.00    0.00    0.00    0.00    3.20    0.00   96.80   7303.40
08:04:56 PM    4    0.00    0.00    0.00    0.00    0.00    1.90    0.00   98.10   7303.60
08:04:56 PM    5    0.00    0.00    0.00    0.00    0.00    1.20    0.00   98.80   7303.50
08:04:56 PM    6    0.00    0.00    0.10    0.00    0.00    1.80    0.00   98.10   7303.50
08:04:56 PM    7    0.00    0.00    0.00    0.00    0.00    1.20    0.00   98.80   7303.50
08:04:56 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00

> ifconfig -a

bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
          inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
          inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
          TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)

bond1     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CD
          inet addr:xxx.xxx.70.156  Bcast:xxx.xxx.70.159  Mask:255.255.255.248
          inet6 addr: fe80::21b:24ff:febd:e9cd/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:239471641 errors:0 dropped:344 overruns:0 frame:0
          TX packets:3704083902 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2488754745 (2.3 GiB)  TX bytes:2685275089 (2.5 GiB)

eth0      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2235085582 errors:0 dropped:353786 overruns:0 frame:0
          TX packets:1266449269 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3768096439 (3.5 GiB)  TX bytes:113363829 (108.1 MiB)
          Memory:fc6e0000-fc700000

eth1      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CD
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:4228974804 errors:0 dropped:344 overruns:0 frame:0
          TX packets:1750216649 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3350270261 (3.1 GiB)  TX bytes:3358220645 (3.1 GiB)
          Memory:fc6c0000-fc6e0000

eth2      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:2495958020 errors:0 dropped:37464 overruns:0 frame:0
          TX packets:1353707165 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:442055526 (421.5 MiB)  TX bytes:2406943933 (2.2 GiB)
          Memory:fcde0000-fce00000

eth3      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CD
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:305464222 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1953867360 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3433479245 (3.1 GiB)  TX bytes:3622113909 (3.3 GiB)
          Memory:fcd80000-fcda0000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:53537 errors:0 dropped:0 overruns:0 frame:0
          TX packets:53537 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:431006433 (411.0 MiB)  TX bytes:431006433 (411.0 MiB)


NOTE: ifconfig drops on bond0/bond1 are *NOT* increasing. These drops are there from before.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 17:41   ` Vladimir Ivashchenko
@ 2009-05-05 18:50     ` Eric Dumazet
  2009-05-05 23:50       ` Vladimir Ivashchenko
  2009-05-06  8:03       ` Ingo Molnar
  2009-05-06  6:10     ` Jarek Poplawski
  1 sibling, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2009-05-05 18:50 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

Vladimir Ivashchenko a écrit :
>>> On both kernels, the system is running with at least 70% idle CPU.
>>> The network interrupts are distributed accross the cores.
>> You should not distribute interrupts, but bound a NIC to one CPU
> 
> Kernels 2.6.28 and 2.6.29 do this by default, so I thought its correct.
> The defaults are wrong?

Yes they are, at least for forwarding setups.

> 
> I have tried with IRQs bound to one CPU per NIC. Same result.

Did you check "grep eth /proc/interrupts" that your affinities setup 
were indeed taken into account ?

You should use same CPU for eth0 and eth2 (bond0),

and another CPU for eth1 and eth3 (bond1)

check how your cpus are setup 

egrep 'physical id|core id|processor' /proc/cpuinfo

Because you might play and find best combo


If you use 2.6.29, apply following patch to get better system accounting,
to check if your cpu are saturated or not by hard/soft irqs

--- linux-2.6.29/kernel/sched.c.orig    2009-05-05 20:46:49.000000000 +0200
+++ linux-2.6.29/kernel/sched.c 2009-05-05 20:47:19.000000000 +0200
@@ -4290,7 +4290,7 @@

        if (user_tick)
                account_user_time(p, one_jiffy, one_jiffy_scaled);
-       else if (p != rq->idle)
+       else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
                account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
                                    one_jiffy_scaled);
        else



> 
>>> I thought it was a e1000e driver issue, but tweaking e1000e ring buffers
>>> didn't help. I tried using e1000 on 2.6.28 by adding necessary PCI IDs,
>>> I tried running on a different server with bnx cards, I tried disabling
>>> NO_HZ and HRTICK, but still I have the same problem.
>>>
>>> However, if I don't utilize bond, but just apply rules on normal ethX
>>> interfaces, there is no packet loss with 2.6.28/29. 
>>>
>>> So, the problem appears only when I use 2.6.28/29 + bond + classful tc
>>> combination. 
>>>
>>> Any ideas ?
>>>
>> Yes, we need much more information :)
>> Is it a forwarding setup only ?
> 
> Yes, the server is doing nothing else but forwarding, no iptables.
> 
>> cat /proc/interrupts
> 
>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>   0:        130          0          0          0          0          0          0          0   IO-APIC-edge      timer
>   1:          2          0          0          0          0          0          0          0   IO-APIC-edge      i8042
>   3:          0          0          0          1          0          1          0          0   IO-APIC-edge
>   4:          0          0          1          0          0          0          1          0   IO-APIC-edge
>   9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
>  12:          4          0          0          0          0          0          0          0   IO-APIC-edge      i8042
>  14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ata_piix
>  15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ata_piix
>  17:      30901      31910      31446      30655      31618      30550      31543      30958   IO-APIC-fasteoi   aacraid
>  20:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
>  21:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb5, ahci
>  22:     298387     297642     295508     294368     295533     295430     295275     296036   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
>  23:      10868      10926      10980      10738      10939      10615      10761      10909   IO-APIC-fasteoi   uhci_hcd:usb3
>  57: 1486251823 1486835830 1486677250 1487105983 1488000303 1485941815 1487728317 1486624997   PCI-MSI-edge      eth0
>  58: 1510676329 1509708161 1510347202 1509969755 1508599471 1511220118 1509094578 1509727616   PCI-MSI-edge      eth1
>  59: 1482578890 1483618556 1482963700 1483164528 1484561615 1482130645 1484116749 1483557717   PCI-MSI-edge      eth2
>  60: 1507341647 1506685822 1506862759 1506612818 1505689367 1507559672 1505911622 1506940613   PCI-MSI-edge      eth3
> NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
> LOC: 1020533656 1020535165 1020533613 1020534967 1020535173 1020534409 1020534985 1020534220   Local timer interrupts
> RES:      18605      21215      15957      18637      22429      19493      16649      15589   Rescheduling interrupts
> CAL:        160        214        186        185        199        205        190        180   Function call interrupts
> TLB:     259515     264126     309016     312222     263163     265601     306189     305430   TLB shootdowns
> TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
> SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
> ERR:          0
> MIS:          0
> 
>> tc -s -d qdisc
> 
> For test sake, I just put "tc qdisc add dev $IFACE root handle 1: prio" and no filters at all. 
> I get the same with HTB "tc qdisc add dev $IFACE root handle 1: htb default 99" and no subclasses.
> 
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 13287736273644 bytes 1263672018 pkt (dropped 0, overlimits 0 requeues 2928480094)
>  rate 0bit 0pps backlog 0b 0p requeues 2928480094
> qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 40064376195000 bytes 1747026586 pkt (dropped 0, overlimits 0 requeues 463621814)
>  rate 0bit 0pps backlog 0b 0p requeues 463621814
> qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 13350145517965 bytes 1350897201 pkt (dropped 0, overlimits 0 requeues 2930879507)
>  rate 0bit 0pps backlog 0b 0p requeues 2930879507
> qdisc pfifo_fast 0: dev eth3 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 40193456126884 bytes 1950653764 pkt (dropped 0, overlimits 0 requeues 465511120)
>  rate 0bit 0pps backlog 0b 0p requeues 465511120
> qdisc prio 1: dev bond0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 985164834 bytes 2720991 pkt (dropped 241834, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc prio 1: dev bond1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 2347118738 bytes 3089171 pkt (dropped 304601, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> 
> ** Drops on bond0/bond1 are increasing by approximately 5000 per second:
> 
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 13287874353796 bytes 1264050808 pkt (dropped 0, overlimits 0 requeues 2928520779)
>  rate 0bit 0pps backlog 0b 0p requeues 2928520779
> qdisc pfifo_fast 0: dev eth1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 40064706826018 bytes 1747459793 pkt (dropped 0, overlimits 0 requeues 463669610)
>  rate 0bit 0pps backlog 0b 0p requeues 463669610
> qdisc pfifo_fast 0: dev eth2 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 13350283202695 bytes 1351277761 pkt (dropped 0, overlimits 0 requeues 2930918488)
>  rate 0bit 0pps backlog 0b 0p requeues 2930918488
> qdisc pfifo_fast 0: dev eth3 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 40193784868074 bytes 1951084029 pkt (dropped 0, overlimits 0 requeues 465558015)
>  rate 0bit 0pps backlog 0b 0p requeues 465558015
> qdisc prio 1: dev bond0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 1260929539 bytes 3480340 pkt (dropped 311145, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc prio 1: dev bond1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 3006490946 bytes 3952643 pkt (dropped 396850, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> 
> With same setup on 2.6.23, drops are increasing only by 50/sec or so.
> 
> As soon as I do "tc qdisc del dev $IFACE root", packet loss stops.
> 
>> cat /proc/net/bonding/bond0
> 
> Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
> 
> Bonding Mode: IEEE 802.3ad Dynamic link aggregation
> Transmit Hash Policy: layer3+4 (1)
> MII Status: up
> MII Polling Interval (ms): 80
> Up Delay (ms): 0
> Down Delay (ms): 0
> 
> 802.3ad info
> LACP rate: slow
> Aggregator selection policy (ad_select): stable
> Active Aggregator Info:
>         Aggregator ID: 1
>         Number of ports: 2
>         Actor Key: 17
>         Partner Key: 4
>         Partner Mac Address: 00:19:e7:b2:07:80
> 
> Slave Interface: eth0
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:24:bd:e9:cc
> Aggregator ID: 1
> 
> Slave Interface: eth2
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:24:bd:e9:ce
> Aggregator ID: 1
> 
>> cat /proc/net/bonding/bond1
> 
> Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
> 
> Bonding Mode: IEEE 802.3ad Dynamic link aggregation
> Transmit Hash Policy: layer3+4 (1)
> MII Status: up
> MII Polling Interval (ms): 80
> Up Delay (ms): 0
> Down Delay (ms): 0
> 
> 802.3ad info
> LACP rate: slow
> Aggregator selection policy (ad_select): stable
> Active Aggregator Info:
>         Aggregator ID: 2
>         Number of ports: 2
>         Actor Key: 17
>         Partner Key: 5
>         Partner Mac Address: 00:19:e7:b2:07:80
> 
> Slave Interface: eth1
> MII Status: up
> Link Failure Count: 1
> Permanent HW addr: 00:1b:24:bd:e9:cd
> Aggregator ID: 2
> 
> Slave Interface: eth3
> MII Status: up
> Link Failure Count: 2
> Permanent HW addr: 00:1b:24:bd:e9:cf
> Aggregator ID: 2
> 
> 
>> mpstat -P ALL 10
> 
> 08:04:36 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
> 08:04:46 PM  all    0.00    0.00    0.01    0.00    0.00    1.05    0.00   98.94  70525.73
> 08:04:46 PM    0    0.00    0.00    0.00    0.00    0.00    0.70    0.00   99.30   7814.41
> 08:04:46 PM    1    0.00    0.00    0.00    0.00    0.00    2.10    0.00   97.90   7814.41
> 08:04:46 PM    2    0.00    0.00    0.00    0.00    0.00    0.20    0.00   99.80   7814.41
> 08:04:46 PM    3    0.00    0.00    0.10    0.00    0.00    1.30    0.00   98.60   7814.51
> 08:04:46 PM    4    0.00    0.00    0.00    0.00    0.00    0.50    0.00   99.50   7814.41
> 08:04:46 PM    5    0.00    0.00    0.00    0.00    0.00    1.90    0.00   98.10   7814.41
> 08:04:46 PM    6    0.00    0.00    0.00    0.00    0.00    0.60    0.00   99.40   7814.41
> 08:04:46 PM    7    0.00    0.00    0.10    0.00    0.00    0.90    0.00   99.00   7814.51
> 08:04:46 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
> 
> 08:04:46 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
> 08:04:56 PM  all    0.00    0.00    0.01    0.00    0.00    1.49    0.00   98.50  66429.30
> 08:04:56 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00   7303.50
> 08:04:56 PM    1    0.00    0.00    0.00    0.00    0.00    1.60    0.00   98.40   7303.50
> 08:04:56 PM    2    0.00    0.00    0.00    0.00    0.00    1.20    0.00   98.80   7303.50
> 08:04:56 PM    3    0.00    0.00    0.00    0.00    0.00    3.20    0.00   96.80   7303.40
> 08:04:56 PM    4    0.00    0.00    0.00    0.00    0.00    1.90    0.00   98.10   7303.60
> 08:04:56 PM    5    0.00    0.00    0.00    0.00    0.00    1.20    0.00   98.80   7303.50
> 08:04:56 PM    6    0.00    0.00    0.10    0.00    0.00    1.80    0.00   98.10   7303.50
> 08:04:56 PM    7    0.00    0.00    0.00    0.00    0.00    1.20    0.00   98.80   7303.50
> 08:04:56 PM    8    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00      0.00
> 
>> ifconfig -a
> 
> bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
>           inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
>           inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
>           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
>           RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
>           TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)
> 
> bond1     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CD
>           inet addr:xxx.xxx.70.156  Bcast:xxx.xxx.70.159  Mask:255.255.255.248
>           inet6 addr: fe80::21b:24ff:febd:e9cd/64 Scope:Link
>           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
>           RX packets:239471641 errors:0 dropped:344 overruns:0 frame:0
>           TX packets:3704083902 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:2488754745 (2.3 GiB)  TX bytes:2685275089 (2.5 GiB)
> 
> eth0      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
>           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
>           RX packets:2235085582 errors:0 dropped:353786 overruns:0 frame:0
>           TX packets:1266449269 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:3768096439 (3.5 GiB)  TX bytes:113363829 (108.1 MiB)
>           Memory:fc6e0000-fc700000
> 
> eth1      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CD
>           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
>           RX packets:4228974804 errors:0 dropped:344 overruns:0 frame:0
>           TX packets:1750216649 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:3350270261 (3.1 GiB)  TX bytes:3358220645 (3.1 GiB)
>           Memory:fc6c0000-fc6e0000
> 
> eth2      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
>           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
>           RX packets:2495958020 errors:0 dropped:37464 overruns:0 frame:0
>           TX packets:1353707165 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:442055526 (421.5 MiB)  TX bytes:2406943933 (2.2 GiB)
>           Memory:fcde0000-fce00000
> 
> eth3      Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CD
>           UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
>           RX packets:305464222 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:1953867360 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:3433479245 (3.1 GiB)  TX bytes:3622113909 (3.3 GiB)
>           Memory:fcd80000-fcda0000
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:53537 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:53537 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:431006433 (411.0 MiB)  TX bytes:431006433 (411.0 MiB)
> 
> 
> NOTE: ifconfig drops on bond0/bond1 are *NOT* increasing. These drops are there from before.
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 18:50     ` Eric Dumazet
@ 2009-05-05 23:50       ` Vladimir Ivashchenko
  2009-05-05 23:52         ` Stephen Hemminger
  2009-05-06  3:36         ` Eric Dumazet
  2009-05-06  8:03       ` Ingo Molnar
  1 sibling, 2 replies; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-05 23:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote:

> > I have tried with IRQs bound to one CPU per NIC. Same result.
> 
> Did you check "grep eth /proc/interrupts" that your affinities setup 
> were indeed taken into account ?
> 
> You should use same CPU for eth0 and eth2 (bond0),
> 
> and another CPU for eth1 and eth3 (bond1)

Ok, the best result is when assign all IRQs to the same CPU. Zero drops.

When I bind slaves of bond interfaces to the same CPU, I start to get 
some drops, but much less than before. I didn't play with combinations.

My problem is, after applying your accounting patch below, one of my 
HTB servers reports only 30-40% CPU idle on one of the cores. That won't 
take me for very long, load balancing across cores is needed.

Is there any way at least to balance individual NICs on per core basis?

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 23:50       ` Vladimir Ivashchenko
@ 2009-05-05 23:52         ` Stephen Hemminger
  2009-05-06  3:36         ` Eric Dumazet
  1 sibling, 0 replies; 27+ messages in thread
From: Stephen Hemminger @ 2009-05-05 23:52 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: Eric Dumazet, netdev

On Wed, 6 May 2009 02:50:08 +0300
Vladimir Ivashchenko <hazard@francoudi.com> wrote:

> On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote:
> 
> > > I have tried with IRQs bound to one CPU per NIC. Same result.
> > 
> > Did you check "grep eth /proc/interrupts" that your affinities setup 
> > were indeed taken into account ?
> > 
> > You should use same CPU for eth0 and eth2 (bond0),
> > 
> > and another CPU for eth1 and eth3 (bond1)
> 
> Ok, the best result is when assign all IRQs to the same CPU. Zero drops.
> 
> When I bind slaves of bond interfaces to the same CPU, I start to get 
> some drops, but much less than before. I didn't play with combinations.
> 
> My problem is, after applying your accounting patch below, one of my 
> HTB servers reports only 30-40% CPU idle on one of the cores. That won't 
> take me for very long, load balancing across cores is needed.
> 
> Is there any way at least to balance individual NICs on per core basis?
> 

The user level irqbalance program is a good place to start:
  http://www.irqbalance.org/
But it doesn't yet no how to handle multi-queue devices, and it seems
to not handle NUMA (like SMP Nehalam) perfectly.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 23:50       ` Vladimir Ivashchenko
  2009-05-05 23:52         ` Stephen Hemminger
@ 2009-05-06  3:36         ` Eric Dumazet
  2009-05-06 10:28           ` Vladimir Ivashchenko
  2009-05-06 18:45           ` Vladimir Ivashchenko
  1 sibling, 2 replies; 27+ messages in thread
From: Eric Dumazet @ 2009-05-06  3:36 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

Vladimir Ivashchenko a écrit :
> On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote:
> 
>>> I have tried with IRQs bound to one CPU per NIC. Same result.
>> Did you check "grep eth /proc/interrupts" that your affinities setup 
>> were indeed taken into account ?
>>
>> You should use same CPU for eth0 and eth2 (bond0),
>>
>> and another CPU for eth1 and eth3 (bond1)
> 
> Ok, the best result is when assign all IRQs to the same CPU. Zero drops.
> 
> When I bind slaves of bond interfaces to the same CPU, I start to get 
> some drops, but much less than before. I didn't play with combinations.
> 
> My problem is, after applying your accounting patch below, one of my 
> HTB servers reports only 30-40% CPU idle on one of the cores. That won't 
> take me for very long, load balancing across cores is needed.
> 
> Is there any way at least to balance individual NICs on per core basis?
> 

Problem of this setup is you have four NICS, but two logical devices (bond0
& bond1) and a central HTB thing. This essentialy makes flows go through the same
locks (some rwlocks guarding bonding driver, and others guarding HTB structures).

Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
another lock guards access to TX queue of ethY device. If another cpus receives
a frame on ethZ and want to forward it to ethY device, this other cpu will
need same locks and everything slowdown.

I am pretty sure you could get good results choosing two cpus sharing same L2
cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
of RX rings on ethX devices. You could try to *reduce* them so that number
of inflight skb is small enough that everything fits in this 6MB cache.

Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
attached to one central memory bank wont increase ram bandwidth, but reduce it.

And making several cores compete for locks on this ram only slows down processing.

Only choice we have is to change bonding so that this driver uses RCU instead
of rwlocks, but it is probably a complex task. Multiple cpus accessing
bonding structures could share memory structures without dirtying them
and ping-pong cache lines.

Ah, I forgot about one patch that could help your setup too (if using more than one
cpu on NIC irqs of course), queued for 2.6.31

(commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0)

You could post oprofile results to help us finding other hot spots.


[PATCH] net: netif_tx_queue_stopped too expensive

netif_tx_queue_stopped(txq) is most of the time false.

Yet its cost is very expensive on SMP.

static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue)
{
	return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
}

I saw this on oprofile hunting and bnx2 driver bnx2_tx_int().

We probably should split "struct netdev_queue" in two parts, one
being read mostly.

__netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these
deserve a separate cache line.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..1caaebb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -447,12 +447,18 @@ enum netdev_queue_state_t
 };
 
 struct netdev_queue {
+/*
+ * read mostly part
+ */
 	struct net_device	*dev;
 	struct Qdisc		*qdisc;
 	unsigned long		state;
-	spinlock_t		_xmit_lock;
-	int			xmit_lock_owner;
 	struct Qdisc		*qdisc_sleeping;
+/*
+ * write mostly part
+ */
+	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
+	int			xmit_lock_owner;
 } ____cacheline_aligned_in_smp;
 
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 17:41   ` Vladimir Ivashchenko
  2009-05-05 18:50     ` Eric Dumazet
@ 2009-05-06  6:10     ` Jarek Poplawski
  2009-05-06 10:36       ` Vladimir Ivashchenko
  1 sibling, 1 reply; 27+ messages in thread
From: Jarek Poplawski @ 2009-05-06  6:10 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: Eric Dumazet, netdev

On 05-05-2009 19:41, Vladimir Ivashchenko wrote:
>>> On both kernels, the system is running with at least 70% idle CPU.
>>> The network interrupts are distributed accross the cores.
>> You should not distribute interrupts, but bound a NIC to one CPU
> 
> Kernels 2.6.28 and 2.6.29 do this by default, so I thought its correct.
> The defaults are wrong?
> 
...
>> ifconfig -a
> 
> bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
>           inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
>           inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
>           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
>           RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
>           TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)

Could you try e.g.: ifconfig bond0 txqueuelen 1000
before tc qdisc add?

Jarek P.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-05 18:50     ` Eric Dumazet
  2009-05-05 23:50       ` Vladimir Ivashchenko
@ 2009-05-06  8:03       ` Ingo Molnar
  1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2009-05-06  8:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vladimir Ivashchenko, netdev


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Vladimir Ivashchenko a écrit :
> >>> On both kernels, the system is running with at least 70% idle CPU.
> >>> The network interrupts are distributed accross the cores.
> >> You should not distribute interrupts, but bound a NIC to one CPU
> > 
> > Kernels 2.6.28 and 2.6.29 do this by default, so I thought its correct.
> > The defaults are wrong?
> 
> Yes they are, at least for forwarding setups.
> 
> > 
> > I have tried with IRQs bound to one CPU per NIC. Same result.
> 
> Did you check "grep eth /proc/interrupts" that your affinities setup 
> were indeed taken into account ?
> 
> You should use same CPU for eth0 and eth2 (bond0),
> 
> and another CPU for eth1 and eth3 (bond1)
> 
> check how your cpus are setup 
> 
> egrep 'physical id|core id|processor' /proc/cpuinfo
> 
> Because you might play and find best combo
> 
> 
> If you use 2.6.29, apply following patch to get better system accounting,
> to check if your cpu are saturated or not by hard/soft irqs
> 
> --- linux-2.6.29/kernel/sched.c.orig    2009-05-05 20:46:49.000000000 +0200
> +++ linux-2.6.29/kernel/sched.c 2009-05-05 20:47:19.000000000 +0200
> @@ -4290,7 +4290,7 @@
> 
>         if (user_tick)
>                 account_user_time(p, one_jiffy, one_jiffy_scaled);
> -       else if (p != rq->idle)
> +       else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
>                 account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
>                                     one_jiffy_scaled);
>         else

Note, your scheduler fix is upstream now in Linus's tree, as:

  f5f293a: sched: account system time properly

"git cherry-pick f5f293a" will apply it to a .29 basis.

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06  3:36         ` Eric Dumazet
@ 2009-05-06 10:28           ` Vladimir Ivashchenko
  2009-05-06 10:41             ` Eric Dumazet
  2009-05-06 18:45           ` Vladimir Ivashchenko
  1 sibling, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-06 10:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, May 06, 2009 at 05:36:08AM +0200, Eric Dumazet wrote:

> > Is there any way at least to balance individual NICs on per core basis?
> > 
> 
> Problem of this setup is you have four NICS, but two logical devices (bond0
> & bond1) and a central HTB thing. This essentialy makes flows go through the same
> locks (some rwlocks guarding bonding driver, and others guarding HTB structures).
> 
> Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
> another lock guards access to TX queue of ethY device. If another cpus receives
> a frame on ethZ and want to forward it to ethY device, this other cpu will
> need same locks and everything slowdown.
> 
> I am pretty sure you could get good results choosing two cpus sharing same L2
> cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
> of RX rings on ethX devices. You could try to *reduce* them so that number
> of inflight skb is small enough that everything fits in this 6MB cache.
> 
> Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
> attached to one central memory bank wont increase ram bandwidth, but reduce it.

Thanks for the detailed explanation.

On the particular server I reported, I worked around the problem by getting rid of classes 
and switching to ingress policers.

However, I have one central box doing HTB, small amount of classes, but 850 mbps of
traffic. The CPU is dual-core 5160 @ 3 Ghz. With 2.6.29 + bond I'm experiencing strange problems 
with HTB, under high load borrowing doesn't seem to work properly. This box has two 
BNX2 and two E1000 NICs, and for some reason I cannot force BNX2 to sit on a single IRQ -
even though I put only one CPU into smp_affinity, it keeps balancing on both. So I cannot
figure out if its related to IRQ balancing or not.

[root@tshape3 tshaper]# cat /proc/irq/63/smp_affinity
01
[root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
 63:   44610754   95469129   PCI-MSI-edge      eth0
[root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
 63:   44614125   95472512   PCI-MSI-edge      eth0

lspci -v:

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
        Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction Gigabit Server Adapter
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 63
        Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
        [virtual] Expansion ROM at 88200000 [disabled] [size=2K]
        Capabilities: [40] PCI-X non-bridge device
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data <?>
        Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
        Kernel driver in use: bnx2
        Kernel modules: bnx2


Any ideas on how to force it on a single CPU ?

Thanks for the new patch, I will try it and let you know.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06  6:10     ` Jarek Poplawski
@ 2009-05-06 10:36       ` Vladimir Ivashchenko
  2009-05-06 10:48         ` Jarek Poplawski
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-06 10:36 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Eric Dumazet, netdev

On Wed, May 06, 2009 at 06:10:10AM +0000, Jarek Poplawski wrote:

> >> ifconfig -a
> > 
> > bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
> >           inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
> >           inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
> >           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
> >           RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
> >           TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)
> 
> Could you try e.g.: ifconfig bond0 txqueuelen 1000
> before tc qdisc add?

The drops on ifconfig are not increasing - these numbers are there from some tests made
before.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 10:28           ` Vladimir Ivashchenko
@ 2009-05-06 10:41             ` Eric Dumazet
  2009-05-06 10:49               ` Denys Fedoryschenko
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2009-05-06 10:41 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

Vladimir Ivashchenko a écrit :
> On Wed, May 06, 2009 at 05:36:08AM +0200, Eric Dumazet wrote:
> 
>>> Is there any way at least to balance individual NICs on per core basis?
>>>
>> Problem of this setup is you have four NICS, but two logical devices (bond0
>> & bond1) and a central HTB thing. This essentialy makes flows go through the same
>> locks (some rwlocks guarding bonding driver, and others guarding HTB structures).
>>
>> Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
>> another lock guards access to TX queue of ethY device. If another cpus receives
>> a frame on ethZ and want to forward it to ethY device, this other cpu will
>> need same locks and everything slowdown.
>>
>> I am pretty sure you could get good results choosing two cpus sharing same L2
>> cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
>> of RX rings on ethX devices. You could try to *reduce* them so that number
>> of inflight skb is small enough that everything fits in this 6MB cache.
>>
>> Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
>> attached to one central memory bank wont increase ram bandwidth, but reduce it.
> 
> Thanks for the detailed explanation.
> 
> On the particular server I reported, I worked around the problem by getting rid of classes 
> and switching to ingress policers.
> 
> However, I have one central box doing HTB, small amount of classes, but 850 mbps of
> traffic. The CPU is dual-core 5160 @ 3 Ghz. With 2.6.29 + bond I'm experiencing strange problems 
> with HTB, under high load borrowing doesn't seem to work properly. This box has two 
> BNX2 and two E1000 NICs, and for some reason I cannot force BNX2 to sit on a single IRQ -
> even though I put only one CPU into smp_affinity, it keeps balancing on both. So I cannot
> figure out if its related to IRQ balancing or not.
> 
> [root@tshape3 tshaper]# cat /proc/irq/63/smp_affinity
> 01
> [root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
>  63:   44610754   95469129   PCI-MSI-edge      eth0
> [root@tshape3 tshaper]# cat /proc/interrupts | grep eth0
>  63:   44614125   95472512   PCI-MSI-edge      eth0
> 
> lspci -v:
> 
> 03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
>         Subsystem: Hewlett-Packard Company NC373i Integrated Multifunction Gigabit Server Adapter
>         Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 63
>         Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
>         [virtual] Expansion ROM at 88200000 [disabled] [size=2K]
>         Capabilities: [40] PCI-X non-bridge device
>         Capabilities: [48] Power Management version 2
>         Capabilities: [50] Vital Product Data <?>
>         Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
>         Kernel driver in use: bnx2
>         Kernel modules: bnx2
> 
> 
> Any ideas on how to force it on a single CPU ?
> 
> Thanks for the new patch, I will try it and let you know.
> 

Yes, its doable but tricky with bnx2, this is a known problem on recent kernels as well.


You must do for example (to bind on CPU 0)

echo 1 >/proc/irq/default_smp_affinity

ifconfig eth1 down
# IRQ of eth1 handled by CPU0 only
echo 1 >/proc/irq/34/smp_affinity
ifconfig eth1 up

ifconfig eth0 down
# IRQ of eth0 handled by CPU0 only
echo 1 >/proc/irq/36/smp_affinity
ifconfig eth0 up


One thing to consider too is the BIOS option you might have, labeled "Adjacent Sector Prefetch"

This basically tells your cpu to use 128 bytes cache lines, instead of 64

In your forwarding worload, I believe this extra prefetch can slowdown your machine.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 10:36       ` Vladimir Ivashchenko
@ 2009-05-06 10:48         ` Jarek Poplawski
  2009-05-06 13:11           ` Vladimir Ivashchenko
  0 siblings, 1 reply; 27+ messages in thread
From: Jarek Poplawski @ 2009-05-06 10:48 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: Eric Dumazet, netdev

On Wed, May 06, 2009 at 01:36:16PM +0300, Vladimir Ivashchenko wrote:
> On Wed, May 06, 2009 at 06:10:10AM +0000, Jarek Poplawski wrote:
> 
> > >> ifconfig -a
> > > 
> > > bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
> > >           inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
> > >           inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
> > >           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
> > >           RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
> > >           TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
> > >           collisions:0 txqueuelen:0
> > >           RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)
> > 
> > Could you try e.g.: ifconfig bond0 txqueuelen 1000
> > before tc qdisc add?
> 
> The drops on ifconfig are not increasing - these numbers are there from some tests made
> before.
> 

I'm not sure what do you mean? IMHO you don't use qdiscs properly,
so any TX problems end with drops. (Older kernel versions could mask
this problem with requeuing.)

Jarek P.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 10:41             ` Eric Dumazet
@ 2009-05-06 10:49               ` Denys Fedoryschenko
  0 siblings, 0 replies; 27+ messages in thread
From: Denys Fedoryschenko @ 2009-05-06 10:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vladimir Ivashchenko, netdev

On Wednesday 06 May 2009 13:41:25 Eric Dumazet wrote:
> You must do for example (to bind on CPU 0)
>
> echo 1 >/proc/irq/default_smp_affinity
>
> ifconfig eth1 down
> # IRQ of eth1 handled by CPU0 only
> echo 1 >/proc/irq/34/smp_affinity
> ifconfig eth1 up
>
> ifconfig eth0 down
> # IRQ of eth0 handled by CPU0 only
> echo 1 >/proc/irq/36/smp_affinity
> ifconfig eth0 up
I think better to use some method over ethtool, that will cause reset.
WHen you do down - you will loose default route, beware of that
>
>
> One thing to consider too is the BIOS option you might have, labeled
> "Adjacent Sector Prefetch"
>
> This basically tells your cpu to use 128 bytes cache lines, instead of 64
>
> In your forwarding worload, I believe this extra prefetch can slowdown your
> machine.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 10:48         ` Jarek Poplawski
@ 2009-05-06 13:11           ` Vladimir Ivashchenko
  2009-05-06 13:31             ` Patrick McHardy
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-06 13:11 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Eric Dumazet, netdev

On Wed, May 06, 2009 at 10:48:08AM +0000, Jarek Poplawski wrote:

> > > >> ifconfig -a
> > > > 
> > > > bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
> > > >           inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
> > > >           inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
> > > >           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
> > > >           RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
> > > >           TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
> > > >           collisions:0 txqueuelen:0
> > > >           RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)
> > > 
> > > Could you try e.g.: ifconfig bond0 txqueuelen 1000
> > > before tc qdisc add?
> > 
> > The drops on ifconfig are not increasing - these numbers are there from some tests made
> > before.
> > 
> 
> I'm not sure what do you mean? IMHO you don't use qdiscs properly,
> so any TX problems end with drops. (Older kernel versions could mask
> this problem with requeuing.)

Apologies, my bad, I misread what you wrote.

txqueuelen 1000 fixes the qdisc drops. I didn't notice that bond interfaces have it set to 0 by default.

As suggested by Eric earlier, the drops also disappear if I bind each NIC to a single CPU. That was the 
default on older kernels and perhaps that's why the issue came up only on 2.6.28.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 13:11           ` Vladimir Ivashchenko
@ 2009-05-06 13:31             ` Patrick McHardy
  0 siblings, 0 replies; 27+ messages in thread
From: Patrick McHardy @ 2009-05-06 13:31 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: Jarek Poplawski, Eric Dumazet, netdev

Vladimir Ivashchenko wrote:
> On Wed, May 06, 2009 at 10:48:08AM +0000, Jarek Poplawski wrote:
> 
>>>>>> ifconfig -a
>>>>> bond0     Link encap:Ethernet  HWaddr 00:1B:24:BD:E9:CC
>>>>>           inet addr:xxx.xxx.135.44  Bcast:xxx.xxx.135.47  Mask:255.255.255.248
>>>>>           inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link
>>>>>           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
>>>>>           RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0
>>>>>           TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0
>>>>>           collisions:0 txqueuelen:0
>>>>>           RX bytes:4210046233 (3.9 GiB)  TX bytes:2520272242 (2.3 GiB)
>>>> Could you try e.g.: ifconfig bond0 txqueuelen 1000
>>>> before tc qdisc add?
>>> The drops on ifconfig are not increasing - these numbers are there from some tests made
>>> before.
>>>
>> I'm not sure what do you mean? IMHO you don't use qdiscs properly,
>> so any TX problems end with drops. (Older kernel versions could mask
>> this problem with requeuing.)
> 
> Apologies, my bad, I misread what you wrote.
> 
> txqueuelen 1000 fixes the qdisc drops. I didn't notice that bond interfaces have it set to 0 by default.

The fifos use a queue length of 1 when the tx_queue_len is zero (which
was added as a workaround for similar problems a long time ago). Perhaps
we should instead refuse these broken configurations or at least print
a warning.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06  3:36         ` Eric Dumazet
  2009-05-06 10:28           ` Vladimir Ivashchenko
@ 2009-05-06 18:45           ` Vladimir Ivashchenko
  2009-05-06 19:30             ` Denys Fedoryschenko
  1 sibling, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-06 18:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


On Wed, 2009-05-06 at 05:36 +0200, Eric Dumazet wrote:

> Ah, I forgot about one patch that could help your setup too (if using more than one
> cpu on NIC irqs of course), queued for 2.6.31

I have tried the patch. Didn't make a noticeable difference. Under 850
mbps HTB+sfq load, 2.6.29.1, four NICs / two bond ifaces, IRQ balancing,
the dual-core server has only 25% idle on each CPU.

What's interesting, the same 850mbps load, identical machine, but with
only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
2.5x overhead.

> (commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0)
> 
> You could post oprofile results to help us finding other hot spots.
> 
> 
> [PATCH] net: netif_tx_queue_stopped too expensive
> 
> netif_tx_queue_stopped(txq) is most of the time false.
> 
> Yet its cost is very expensive on SMP.
> 
> static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue)
> {
> 	return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
> }
> 
> I saw this on oprofile hunting and bnx2 driver bnx2_tx_int().
> 
> We probably should split "struct netdev_queue" in two parts, one
> being read mostly.
> 
> __netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these
> deserve a separate cache line.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 2e7783f..1caaebb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -447,12 +447,18 @@ enum netdev_queue_state_t
>  };
>  
>  struct netdev_queue {
> +/*
> + * read mostly part
> + */
>  	struct net_device	*dev;
>  	struct Qdisc		*qdisc;
>  	unsigned long		state;
> -	spinlock_t		_xmit_lock;
> -	int			xmit_lock_owner;
>  	struct Qdisc		*qdisc_sleeping;
> +/*
> + * write mostly part
> + */
> +	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
> +	int			xmit_lock_owner;
>  } ____cacheline_aligned_in_smp;
>  
> 
-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 18:45           ` Vladimir Ivashchenko
@ 2009-05-06 19:30             ` Denys Fedoryschenko
  2009-05-06 20:47               ` Vladimir Ivashchenko
  0 siblings, 1 reply; 27+ messages in thread
From: Denys Fedoryschenko @ 2009-05-06 19:30 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: Eric Dumazet, netdev

On Wednesday 06 May 2009 21:45:18 Vladimir Ivashchenko wrote:
> On Wed, 2009-05-06 at 05:36 +0200, Eric Dumazet wrote:
> > Ah, I forgot about one patch that could help your setup too (if using
> > more than one cpu on NIC irqs of course), queued for 2.6.31
>
> I have tried the patch. Didn't make a noticeable difference. Under 850
> mbps HTB+sfq load, 2.6.29.1, four NICs / two bond ifaces, IRQ balancing,
> the dual-core server has only 25% idle on each CPU.
>
> What's interesting, the same 850mbps load, identical machine, but with
> only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
> 2.5x overhead.

Probably oprofile can sched some light on this.
On my own experience IRQ balancing hurt performance a lot, because of cache 
misses.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 19:30             ` Denys Fedoryschenko
@ 2009-05-06 20:47               ` Vladimir Ivashchenko
  2009-05-06 21:46                 ` Denys Fedoryschenko
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-06 20:47 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: netdev

On Wed, May 06, 2009 at 10:30:04PM +0300, Denys Fedoryschenko wrote:

> > What's interesting, the same 850mbps load, identical machine, but with
> > only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
> > 2.5x overhead.
> 
> Probably oprofile can sched some light on this.
> On my own experience IRQ balancing hurt performance a lot, because of cache 
> misses.

This is a dual-core machine, isn't cache shared between the cores?

Without IRQ balancing, one of the cores goes around 10% idle and HTB doesn't do
its job properly. Actually, in my experience HTB stops working properly after
idle goes below 35%.

I'll try gathering some stats using oprofile.

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 20:47               ` Vladimir Ivashchenko
@ 2009-05-06 21:46                 ` Denys Fedoryschenko
  2009-05-08 20:46                   ` Vladimir Ivashchenko
  0 siblings, 1 reply; 27+ messages in thread
From: Denys Fedoryschenko @ 2009-05-06 21:46 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

On Wednesday 06 May 2009 23:47:59 Vladimir Ivashchenko wrote:
> On Wed, May 06, 2009 at 10:30:04PM +0300, Denys Fedoryschenko wrote:
> > > What's interesting, the same 850mbps load, identical machine, but with
> > > only two NICs and no bond, HTB+esfq, kernel 2.6.21.2 => 60% CPU idle.
> > > 2.5x overhead.
> >
> > Probably oprofile can sched some light on this.
> > On my own experience IRQ balancing hurt performance a lot, because of
> > cache misses.
>
> This is a dual-core machine, isn't cache shared between the cores?
>
> Without IRQ balancing, one of the cores goes around 10% idle and HTB
> doesn't do its job properly. Actually, in my experience HTB stops working
> properly after idle goes below 35%.
It seems they should. No idea, more experienced guys should know more.

Can you show me please
cat /proc/net/psched
If it is highres working, try to add in HTB script, first line

HZ=1000
to set environment variable. Because if clock resolution high, burst 
calculation going crazy on high speeds.
Maybe it will help.

Also without irq balance, did you try to assign interface to cpu by 
smp_affinity? (/proc/irq/NN/smp_affinity)

And still i think best thing is oprofile. It can show "hot" places in code, 
who is spending cpu cycles.

>
> I'll try gathering some stats using oprofile.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-06 21:46                 ` Denys Fedoryschenko
@ 2009-05-08 20:46                   ` Vladimir Ivashchenko
  2009-05-08 21:05                     ` Denys Fedoryschenko
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-08 20:46 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: netdev


> > Without IRQ balancing, one of the cores goes around 10% idle and HTB
> > doesn't do its job properly. Actually, in my experience HTB stops working
> > properly after idle goes below 35%.
> It seems they should. No idea, more experienced guys should know more.
> 
> Can you show me please
> cat /proc/net/psched
> If it is highres working, try to add in HTB script, first line
> 
> HZ=1000
> to set environment variable. Because if clock resolution high, burst 
> calculation going crazy on high speeds.
> Maybe it will help.

Wow, instead of 98425b burst, its calculating 970203b. 

Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
to 1000 Hz and the burst is calculated correctly, for some reason HTB on
2.6.29 is still worse at rate control than 2.6.21.

With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
With 2.6.29, same ceil/burst -> actual rate 890 mbits.

Moreover, after I stop the traffic *COMPLETELY* on 2.6.29, actual rate
reported by htb goes ballistic and stays at 1100mbits. Then it drops
back to expected value after a minute or so.

> Also without irq balance, did you try to assign interface to cpu by 
> smp_affinity? (/proc/irq/NN/smp_affinity)

Yes, I did, didn't make any difference.

> And still i think best thing is oprofile. It can show "hot" places in code, 
> who is spending cpu cycles.

For some reason I get a hard freeze when I start oprofile daemon, even
without traffic. Never used oprofile before, so I'm not sure if I'm
doing something wrong ... I'm starting it just with --vmlinux parameter
and nothing else. I use vanilla 2.6.29 and oprofile from FC8.

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-08 20:46                   ` Vladimir Ivashchenko
@ 2009-05-08 21:05                     ` Denys Fedoryschenko
  2009-05-08 22:07                       ` Vladimir Ivashchenko
  0 siblings, 1 reply; 27+ messages in thread
From: Denys Fedoryschenko @ 2009-05-08 21:05 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

On Friday 08 May 2009 23:46:11 Vladimir Ivashchenko wrote:
> > > Without IRQ balancing, one of the cores goes around 10% idle and HTB
> > > doesn't do its job properly. Actually, in my experience HTB stops
> > > working properly after idle goes below 35%.
> >
> > It seems they should. No idea, more experienced guys should know more.
> >
> > Can you show me please
> > cat /proc/net/psched
> > If it is highres working, try to add in HTB script, first line
> >
> > HZ=1000
> > to set environment variable. Because if clock resolution high, burst
> > calculation going crazy on high speeds.
> > Maybe it will help.
>
> Wow, instead of 98425b burst, its calculating 970203b.
Kind of strange burst, something wrong there. For 1000HZ and 1 Gbit it should 
be 126375b. You value is for 8Gbit/s.
What version of iproute2 you are using ( tc -V )?

>
> Exporting HZ=1000 doesn't help. However, even if I recompile the kernel
> to 1000 Hz and the burst is calculated correctly, for some reason HTB on
> 2.6.29 is still worse at rate control than 2.6.21.
>
> With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> With 2.6.29, same ceil/burst -> actual rate 890 mbits.
It depends also if there is child classes, what is bursts set for them, and 
what is ceil/burst set for them.

>
> Moreover, after I stop the traffic *COMPLETELY* on 2.6.29, actual rate
> reported by htb goes ballistic and stays at 1100mbits. Then it drops
> back to expected value after a minute or so.
It is average bandwidth for some period, it is not realtime value. 

>
> > Also without irq balance, did you try to assign interface to cpu by
> > smp_affinity? (/proc/irq/NN/smp_affinity)
>
> Yes, I did, didn't make any difference.
What is a clock source?
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Timer resolution?
cat /proc/net/psched

>
> > And still i think best thing is oprofile. It can show "hot" places in
> > code, who is spending cpu cycles.
>
> For some reason I get a hard freeze when I start oprofile daemon, even
> without traffic. Never used oprofile before, so I'm not sure if I'm
> doing something wrong ... I'm starting it just with --vmlinux parameter
> and nothing else. I use vanilla 2.6.29 and oprofile from FC8.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-08 21:05                     ` Denys Fedoryschenko
@ 2009-05-08 22:07                       ` Vladimir Ivashchenko
  2009-05-08 22:42                         ` Denys Fedoryschenko
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-08 22:07 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: netdev

> > Wow, instead of 98425b burst, its calculating 970203b.
> Kind of strange burst, something wrong there. For 1000HZ and 1 Gbit it should 
> be 126375b. You value is for 8Gbit/s.
> What version of iproute2 you are using ( tc -V )?

That was iproute2-ss080725, I think it is confused by tickless mode.
With iproute2-ss090324 I'm getting an opposite: 1589b :)

> >
> > With 2.6.21, ceil of 775 mbits, burst 99425b -> actual rate 825 mbits.
> > With 2.6.29, same ceil/burst -> actual rate 890 mbits.
> It depends also if there is child classes, what is bursts set for them, and 
> what is ceil/burst set for them.

All child classes have smaller bursts than the parent. However, there are two 
sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
don't know HTB internals, perhaps these two classes make the parent class 
overstretch itself.

By the way, I experience the same "overstretching" with hfsc. In any case, 
I prefer HTB because it reports statistics of parent classes, unlike hfsc.

> > Moreover, after I stop the traffic *COMPLETELY* on 2.6.29, actual rate
> > reported by htb goes ballistic and stays at 1100mbits. Then it drops
> > back to expected value after a minute or so.
> It is average bandwidth for some period, it is not realtime value. 

But why it would it jump from 850mbits to 1200mbits *AFTER* I remove all
the traffic ?

> > Yes, I did, didn't make any difference.
> What is a clock source?
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource

tsc

> Timer resolution?
> cat /proc/net/psched

With tickless kernel:

000003e8 00000400 000f4240 3b9aca00

-- 
Best Regards
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel, Cyprus - www.prime-tel.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-08 22:07                       ` Vladimir Ivashchenko
@ 2009-05-08 22:42                         ` Denys Fedoryschenko
  2009-05-17 18:46                           ` Vladimir Ivashchenko
  0 siblings, 1 reply; 27+ messages in thread
From: Denys Fedoryschenko @ 2009-05-08 22:42 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: netdev

On Saturday 09 May 2009 01:07:27 Vladimir Ivashchenko wrote:
> > > Wow, instead of 98425b burst, its calculating 970203b.
> >
> > Kind of strange burst, something wrong there. For 1000HZ and 1 Gbit it
> > should be 126375b. You value is for 8Gbit/s.
> > What version of iproute2 you are using ( tc -V )?
>
> That was iproute2-ss080725, I think it is confused by tickless mode.
> With iproute2-ss090324 I'm getting an opposite: 1589b :)
And it is too low. Thats why i set HZ=1000
>
>
> All child classes have smaller bursts than the parent. However, there are
> two sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
> don't know HTB internals, perhaps these two classes make the parent class
> overstretch itself.
As i remember important to keep sum of child rates lower or equal parent rate.
Sure ceil of childs must not exceed ceil of parent.
Sometimes i had mess, when i tried to play with quantum value. After all that 
i switched to HFSC which works for me flawlessly. Maybe we should give more 
attention to HTB problem with high speeds and help kernel developers spot 
problem, if there is any.

>
> By the way, I experience the same "overstretching" with hfsc. In any case,
> I prefer HTB because it reports statistics of parent classes, unlike hfsc.
Sometimes it happen when some offloading enabled on devices.
Check ethtool -k device

I think everything except rx/tx checksumming have to be off, at least for 
test.

Disable them by "ethtool -K device tso off " for example.


>
> But why it would it jump from 850mbits to 1200mbits *AFTER* I remove all
> the traffic ?
>
Well, i dont know how it is doing averaging, even maybe for 1 minute. 
I dont like it at all, and thats why i prefer HFSC. But HTB work very well in 
some setups

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-08 22:42                         ` Denys Fedoryschenko
@ 2009-05-17 18:46                           ` Vladimir Ivashchenko
  2009-05-18  8:51                             ` Jarek Poplawski
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Ivashchenko @ 2009-05-17 18:46 UTC (permalink / raw)
  To: Denys Fedoryschenko; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1948 bytes --]


> > All child classes have smaller bursts than the parent. However, there are
> > two sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
> > don't know HTB internals, perhaps these two classes make the parent class
> > overstretch itself.
> As i remember important to keep sum of child rates lower or equal parent rate.
> Sure ceil of childs must not exceed ceil of parent.
> Sometimes i had mess, when i tried to play with quantum value. After all that 
> i switched to HFSC which works for me flawlessly. Maybe we should give more 
> attention to HTB problem with high speeds and help kernel developers spot 
> problem, if there is any.

In case of HFSC my problem is even worse. With 775mbit ceiling
configured it is passing over 900mbit in reality. Moreover not having
statistics for parent classes makes it difficult to troubleshoot :( I'm
100% sure that it is 900 mbps, I see this on the switch.

Attached is "tc -s -d class show dev bond0" output.

To calculate total traffic rate:

$ cat hfsc-stat.txt | grep rate | grep Kbit | sed 's/Kbit//' | awk
'{ a=a+$2; } END { print a; }'
906955

Did I misconfigure something ?... How can hfsc go above 775mbit when
everything goes via class 1:2 with 775mbit rate & ul ?

> > By the way, I experience the same "overstretching" with hfsc. In any case,
> > I prefer HTB because it reports statistics of parent classes, unlike hfsc.
> Sometimes it happen when some offloading enabled on devices.
> Check ethtool -k device

Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off


> I think everything except rx/tx checksumming have to be off, at least for 
> test.
> 
> Disable them by "ethtool -K device tso off " for example.

Doesn't help.

-- 
Best Regards,
Vladimir Ivashchenko
Chief Technology Officer
PrimeTel PLC, Cyprus - www.prime-tel.com
Tel: +357 25 100100 Fax: +357 2210 2211


[-- Attachment #2: hfsc-stat.txt --]
[-- Type: text/plain, Size: 18482 bytes --]

class hfsc 1:99 parent 1: leaf 99: sc m1 0bit d 15.0ms m2 542500Kbit 
 Sent 20385133 bytes 192619 pkt (dropped 7070, overlimits 0 requeues 0) 
 rate 402936bit 432pps backlog 0b 0p requeues 0 
 period 192581 work 20385133 bytes rtwork 20353003 bytes level 0 

class hfsc 1:5005 parent 1:97 leaf 5005: sc m1 0bit d 200.0ms m2 6103Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 2948666978 bytes 4359774 pkt (dropped 50623, overlimits 0 requeues 0) 
 rate 55946Kbit 10338pps backlog 0b 7p requeues 0 
 period 1246140 work 2945839727 bytes rtwork 332352472 bytes level 0 

class hfsc 1:98 parent 1:2 leaf 98: sc m1 0bit d 500.0ms m2 43400Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 4644738789 bytes 5853965 pkt (dropped 165294, overlimits 0 requeues 0) 
 rate 85749Kbit 13526pps backlog 0b 0p requeues 0 
 period 1968169 work 4633993957 bytes rtwork 2322556251 bytes level 0 

class hfsc 1:100 parent 1:2 sc m1 0bit d 5.0ms m2 15000Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 290192 work 76779316 bytes level 1 

class hfsc 1:10 parent 1:2 sc m1 0bit d 1.0ms m2 542500Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 464809 work 877481554 bytes level 1 

class hfsc 1:5004 parent 1:97 leaf 5004: sc m1 0bit d 200.0ms m2 6103Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 3181480387 bytes 4586738 pkt (dropped 50789, overlimits 0 requeues 0) 
 rate 60049Kbit 10765pps backlog 0b 9p requeues 0 
 period 1258266 work 3178564479 bytes rtwork 332600104 bytes level 0 

class hfsc 1:5006 parent 1:97 leaf 5006: sc m1 0bit d 200.0ms m2 6103Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 842333834 bytes 1339029 pkt (dropped 9194, overlimits 0 requeues 0) 
 rate 18154Kbit 3349pps backlog 0b 2p requeues 0 
 period 806244 work 841676259 bytes rtwork 332340905 bytes level 0 

class hfsc 1:3e9 parent 1:200 leaf 3e9: sc m1 0bit d 0us m2 256000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 23292 bytes 308 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 488bit 1pps backlog 0b 0p requeues 0 
 period 298 work 23292 bytes rtwork 21371 bytes level 0 

class hfsc 1:5001 parent 1:97 leaf 5001: sc m1 0bit d 200.0ms m2 6103Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 2541073789 bytes 3768515 pkt (dropped 39084, overlimits 0 requeues 0) 
 rate 52514Kbit 9201pps backlog 0b 4p requeues 0 
 period 1221187 work 2538919598 bytes rtwork 332543703 bytes level 0 

class hfsc 1:3e8 parent 1:200 leaf 3e8: sc m1 0bit d 0us m2 512000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 2140 bytes 34 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 32bit 0pps backlog 0b 0p requeues 0 
 period 34 work 2140 bytes rtwork 2140 bytes level 0 

class hfsc 1:3eb parent 1:300 leaf 3eb: sc m1 0bit d 0us m2 64000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 206 bytes 3 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 3 work 206 bytes rtwork 206 bytes level 0 

class hfsc 1:5003 parent 1:97 leaf 5003: sc m1 0bit d 200.0ms m2 6103Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 2245687618 bytes 3498522 pkt (dropped 35135, overlimits 0 requeues 0) 
 rate 42763Kbit 8415pps backlog 0b 5p requeues 0 
 period 1192583 work 2244028709 bytes rtwork 332304111 bytes level 0 

class hfsc 1:3ea parent 1:300 leaf 3ea: sc m1 0bit d 0us m2 1024Kbit ul m1 0bit d 0us m2 100000Kbit 
 Sent 54214 bytes 442 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 16bit 0pps backlog 0b 0p requeues 0 
 period 413 work 54214 bytes rtwork 48427 bytes level 0 

class hfsc 1:5002 parent 1:97 leaf 5002: sc m1 0bit d 200.0ms m2 6103Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 2915382517 bytes 4002191 pkt (dropped 35677, overlimits 0 requeues 0) 
 rate 54029Kbit 9447pps backlog 0b 7p requeues 0 
 period 1219564 work 2912952580 bytes rtwork 332338435 bytes level 0 

class hfsc 1:97 parent 1:2 sc m1 0bit d 200.0ms m2 97650Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 1853176 work 14661981352 bytes level 1 

class hfsc 1: root 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 0 level 3 

class hfsc 1:4004 parent 1:79 leaf 4004: sc m1 0bit d 50.0ms m2 25090Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 4401715817 bytes 4690294 pkt (dropped 32975, overlimits 0 requeues 0) 
 rate 84643Kbit 11103pps backlog 0b 0p requeues 0 
 period 3255986 work 4402143323 bytes rtwork 1368797472 bytes level 0 

class hfsc 1:4005 parent 1:79 leaf 4005: sc m1 0bit d 50.0ms m2 25090Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 4858823203 bytes 5510336 pkt (dropped 45336, overlimits 0 requeues 0) 
 rate 90141Kbit 12773pps backlog 0b 0p requeues 0 
 period 3648819 work 4859323215 bytes rtwork 1370067779 bytes level 0 

class hfsc 1:2 parent 1: sc m1 0bit d 0us m2 775000Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 6902 work 48071507639 bytes level 2 

class hfsc 1:4006 parent 1:79 leaf 4006: sc m1 0bit d 50.0ms m2 25090Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 2215296226 bytes 2405111 pkt (dropped 16771, overlimits 0 requeues 0) 
 rate 41477Kbit 5802pps backlog 0b 0p requeues 0 
 period 2036400 work 2217428341 bytes rtwork 1369522295 bytes level 0 

class hfsc 1:c9 parent 1:200 leaf c9: sc m1 0bit d 0us m2 32000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 240026752 bytes 543130 pkt (dropped 7429, overlimits 0 requeues 0) 
 rate 4201Kbit 1193pps backlog 0b 0p requeues 0 
 period 307816 work 240015088 bytes rtwork 1763247 bytes level 0 

class hfsc 1:4001 parent 1:79 leaf 4001: sc m1 0bit d 50.0ms m2 25090Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 4432353076 bytes 4969719 pkt (dropped 31037, overlimits 0 requeues 0) 
 rate 79306Kbit 10880pps backlog 0b 0p requeues 0 
 period 3452541 work 4431357910 bytes rtwork 1370028629 bytes level 0 

class hfsc 1:4002 parent 1:79 leaf 4002: sc m1 0bit d 50.0ms m2 25090Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 4446019233 bytes 4948736 pkt (dropped 42181, overlimits 0 requeues 0) 
 rate 87046Kbit 11613pps backlog 0b 0p requeues 0 
 period 3427533 work 4448206779 bytes rtwork 1370171471 bytes level 0 

class hfsc 1:4003 parent 1:79 leaf 4003: sc m1 0bit d 50.0ms m2 25090Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 3759018918 bytes 4336293 pkt (dropped 23579, overlimits 0 requeues 0) 
 rate 75258Kbit 10430pps backlog 0b 0p requeues 0 
 period 3147915 work 3758684585 bytes rtwork 1368528984 bytes level 0 

class hfsc 1:80 parent 1:2 leaf 80: sc m1 0bit d 50.0ms m2 1000bit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 0 level 0 

class hfsc 1:b parent 1:10 leaf b: sc m1 0bit d 0us m2 32000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 878491329 bytes 1005975 pkt (dropped 15849, overlimits 0 requeues 0) 
 rate 14093Kbit 2159pps backlog 0b 0p requeues 0 
 period 464809 work 877481554 bytes rtwork 1891407 bytes level 0 

class hfsc 1:300 parent 1:2 sc m1 0bit d 15.0ms m2 30000Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 742649 work 1854498600 bytes level 1 

class hfsc 1:3999 parent 1:79 leaf 3999: sc m1 0bit d 0us m2 1000bit ul m1 0bit d 0us m2 30432Kbit 
 Sent 1617385819 bytes 1260031 pkt (dropped 110848, overlimits 0 requeues 0) 
 rate 30581Kbit 2977pps backlog 0b 56p requeues 0 
 period 1381 work 1609588187 bytes rtwork 58059 bytes level 0 

class hfsc 1:12d parent 1:300 leaf 12d: sc m1 0bit d 0us m2 32000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 1860128386 bytes 2152291 pkt (dropped 51268, overlimits 0 requeues 0) 
 rate 29842Kbit 4647pps backlog 0b 2p requeues 0 
 period 742499 work 1854444180 bytes rtwork 1775294 bytes level 0 

class hfsc 1:79 parent 1:2 sc m1 0bit d 0us m2 401450Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 9008 work 25726746683 bytes level 1 

class hfsc 1:2000 parent 1:79 sc m1 0bit d 0us m2 1000bit ul m1 0bit d 0us m2 401450Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 0 level 0 

class hfsc 1:200 parent 1:2 sc m1 0bit d 10.0ms m2 35000Kbit ul m1 0bit d 0us m2 542500Kbit 
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
 rate 0bit 0pps backlog 0b 0p requeues 0 
 period 308015 work 240040520 bytes level 1 

class hfsc 1:65 parent 1:100 leaf 65: sc m1 0bit d 0us m2 32000bit ul m1 0bit d 0us m2 100000Kbit 
 Sent 76808971 bytes 457236 pkt (dropped 16755, overlimits 0 requeues 0) 
 rate 1163Kbit 989pps backlog 0b 2p requeues 0 
 period 290193 work 76779316 bytes rtwork 1743830 bytes level 0 

class prio 99:1 parent 99: 
 Sent 1543 bytes 11 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
class prio 99:2 parent 99: 
 Sent 19693553 bytes 185805 pkt (dropped 6999, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
class prio 99:3 parent 99: 
 Sent 690143 bytes 6804 pkt (dropped 71, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
class sfq 3999:1a parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 5942 

class sfq 3999:2d parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot -1279 

class sfq 3999:3e parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3072 

class sfq 3999:46 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot -1035 

class sfq 3999:ae parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot -871 

class sfq 3999:c4 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot 1429 

class sfq 3999:e3 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot 388 

class sfq 3999:ee parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot -582 

class sfq 3999:155 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot -1409 

class sfq 3999:17c parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot -128 

class sfq 3999:1c2 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot 9048 

class sfq 3999:1ca parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 6p requeues 0 
 allot -1156 

class sfq 3999:217 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot -90 

class sfq 3999:229 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 336 

class sfq 3999:234 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 4p requeues 0 
 allot -1248 

class sfq 3999:236 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot -350 

class sfq 3999:24d parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot 2602 

class sfq 3999:259 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot 12595 

class sfq 3999:27e parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 15861 

class sfq 3999:28e parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 578 

class sfq 3999:2af parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 6p requeues 0 
 allot -280 

class sfq 3999:32b parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 109 

class sfq 3999:37e parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 6p requeues 0 
 allot -1355 

class sfq 3999:3ed parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot -1080 

class sfq 3999:400 parent 3999: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot -293 

class sfq 98:9 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 7590 

class sfq 98:88 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 10598 

class sfq 98:10c parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 16677 

class sfq 98:111 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 9084 

class sfq 98:22f parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3028 

class sfq 98:294 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 12054 

class sfq 98:2f9 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 23634 

class sfq 98:322 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot -28605 

class sfq 98:364 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 6056 

class sfq 98:3b3 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 4546 

class sfq 98:3f9 parent 98: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 30100 

class sfq 5001:39a parent 5001: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 24540 

class sfq 4002:226 parent 4002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 4542 

class sfq 5002:49 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 9084 

class sfq 5002:a7 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 9084 

class sfq 5002:138 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 7570 

class sfq 5002:167 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 4542 

class sfq 5002:19c parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 10606 

class sfq 5002:20b parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot 2974 

class sfq 5002:22a parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 12112 

class sfq 5002:256 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 25094 

class sfq 5002:289 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 9084 

class sfq 5002:2f4 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3028 

class sfq 5002:341 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 7570 

class sfq 5002:351 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 8974 

class sfq 5002:376 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 6056 

class sfq 5002:3a6 parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3028 

class sfq 5002:3bc parent 5002: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 6064 

class sfq 5003:68 parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot 4542 

class sfq 5003:a7 parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot -24050 

class sfq 5003:14d parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 2056 

class sfq 5003:1f3 parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 6056 

class sfq 5003:27f parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3036 

class sfq 5003:2e8 parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 28716 

class sfq 5003:311 parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 13572 

class sfq 5003:31a parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 13626 

class sfq 5003:361 parent 5003: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3048 

class sfq 5004:1e parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 10932 

class sfq 5004:101 parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 4542 

class sfq 5004:12e parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 9084 

class sfq 5004:156 parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 19526 

class sfq 5004:179 parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 671163409p requeues 0 
 allot 724 

class sfq 5004:1a0 parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 19682 

class sfq 5004:34c parent 5004: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot -30965 

class sfq 4005:134 parent 4005: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 1514 

class sfq 12d:47 parent 12d: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 4542 

class sfq 12d:a8 parent 12d: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 3p requeues 0 
 allot 1514 

class sfq 12d:ef parent 12d: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 1514 

class sfq 12d:368 parent 12d: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 3268 

class sfq 65:4e parent 65: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 2p requeues 0 
 allot 1514 

class sfq 65:91 parent 65: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 1514 

class sfq 65:286 parent 65: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 10530 

class sfq c9:357 parent c9: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 1514 

class sfq b:72 parent b: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 1514 

class sfq b:1ab parent b: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 13560 

class sfq b:20b parent b: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 13572 

class sfq b:38e parent b: 
 (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 1p requeues 0 
 allot 1514 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: bond + tc regression ?
  2009-05-17 18:46                           ` Vladimir Ivashchenko
@ 2009-05-18  8:51                             ` Jarek Poplawski
  0 siblings, 0 replies; 27+ messages in thread
From: Jarek Poplawski @ 2009-05-18  8:51 UTC (permalink / raw)
  To: Vladimir Ivashchenko; +Cc: Denys Fedoryschenko, netdev

On 17-05-2009 20:46, Vladimir Ivashchenko wrote:
>>> All child classes have smaller bursts than the parent. However, there are
>>> two sub-classes which have ceil at 70% of parent, e.g. ~500mbit each. I
>>> don't know HTB internals, perhaps these two classes make the parent class
>>> overstretch itself.
>> As i remember important to keep sum of child rates lower or equal parent rate.
>> Sure ceil of childs must not exceed ceil of parent.
>> Sometimes i had mess, when i tried to play with quantum value. After all that 
>> i switched to HFSC which works for me flawlessly. Maybe we should give more 
>> attention to HTB problem with high speeds and help kernel developers spot 
>> problem, if there is any.
> 
> In case of HFSC my problem is even worse. With 775mbit ceiling
> configured it is passing over 900mbit in reality. Moreover not having
> statistics for parent classes makes it difficult to troubleshoot :( I'm
> 100% sure that it is 900 mbps, I see this on the switch.
> 
> Attached is "tc -s -d class show dev bond0" output.
> 
> To calculate total traffic rate:
> 
> $ cat hfsc-stat.txt | grep rate | grep Kbit | sed 's/Kbit//' | awk
> '{ a=a+$2; } END { print a; }'
> 906955
> 
> Did I misconfigure something ?... How can hfsc go above 775mbit when
> everything goes via class 1:2 with 775mbit rate & ul ?

Maybe... It's a lot of checking - it seems test cases could be simpler
to show the real problem. Anyway, it looks like the sum of m2 of 1:2
children is more than 775Mbit.


>>> By the way, I experience the same "overstretching" with hfsc. In any case,
>>> I prefer HTB because it reports statistics of parent classes, unlike hfsc.
>> Sometimes it happen when some offloading enabled on devices.
>> Check ethtool -k device
> 
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: off
> udp fragmentation offload: off

Current versions of ethtool should show "generic segmentation offload"
too.

I hope you've read the nearby thread "HTB accuracy for high speed",
which explains at least partially some problems/bugs, and maybe you'll
try some patches too (at last one of them addresses the problem you've
reported). Anyway, if you don't find hfsc is better for you I'd be
more interested in tracking this on htb test cases yet.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-05-18  8:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-05 15:45 bond + tc regression ? Vladimir Ivashchenko
2009-05-05 16:25 ` Denys Fedoryschenko
2009-05-05 16:31 ` Eric Dumazet
2009-05-05 17:41   ` Vladimir Ivashchenko
2009-05-05 18:50     ` Eric Dumazet
2009-05-05 23:50       ` Vladimir Ivashchenko
2009-05-05 23:52         ` Stephen Hemminger
2009-05-06  3:36         ` Eric Dumazet
2009-05-06 10:28           ` Vladimir Ivashchenko
2009-05-06 10:41             ` Eric Dumazet
2009-05-06 10:49               ` Denys Fedoryschenko
2009-05-06 18:45           ` Vladimir Ivashchenko
2009-05-06 19:30             ` Denys Fedoryschenko
2009-05-06 20:47               ` Vladimir Ivashchenko
2009-05-06 21:46                 ` Denys Fedoryschenko
2009-05-08 20:46                   ` Vladimir Ivashchenko
2009-05-08 21:05                     ` Denys Fedoryschenko
2009-05-08 22:07                       ` Vladimir Ivashchenko
2009-05-08 22:42                         ` Denys Fedoryschenko
2009-05-17 18:46                           ` Vladimir Ivashchenko
2009-05-18  8:51                             ` Jarek Poplawski
2009-05-06  8:03       ` Ingo Molnar
2009-05-06  6:10     ` Jarek Poplawski
2009-05-06 10:36       ` Vladimir Ivashchenko
2009-05-06 10:48         ` Jarek Poplawski
2009-05-06 13:11           ` Vladimir Ivashchenko
2009-05-06 13:31             ` Patrick McHardy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.