All of lore.kernel.org
 help / color / mirror / Atom feed
* PROBLEM: IProute hangs after running traffic shaping scripts
@ 2004-11-05  9:48 Szymon Miotk
  2004-11-05 11:54 ` Thomas Graf
  2004-11-07 22:22 ` PROBLEM: IProute hangs after running traffic shaping scripts Patrick McHardy
  0 siblings, 2 replies; 34+ messages in thread
From: Szymon Miotk @ 2004-11-05  9:48 UTC (permalink / raw)
  To: netdev; +Cc: kuznet, jmorris

This mail was posted 02-11-2004 to linux-net@vger.kernel.org, but I got 
no response at all, so I am resending it.

[1.] One line summary of the problem:
IProute hangs after running traffic shaping scripts

[2.] Full description of the problem/report:
I have a server with 3 links to ISPs and 1 link for internal network.
I shape my clients to certain speeds, depending on the time of the day.
I have HTB shaping on each interface, about 2250 classes and 2250 qdisc
on each, so it makes total ~9000 classes (HTB) and ~9000 qdisc (SFQ).
I run shaping scripts 4 times/day.
Sometimes it makes a kernel oops, hangs at some 'tc ...' command (it
differs).
Then the shaping works so-so (usually it works, but doesn't fully
utilize the bandwidth) and every iproute command hangs.
Killing the hanging processes kills them, but still every iproute
command hangs, including ip and tc.
Sometimes the server stops forwarding, but usually it does so few hours
after kernel oops.
Reboot always helps.

[3.] Keywords (i.e., modules, networking, kernel):
traffic shaping, htb, qdisc, networking, kernel

[4.] Kernel version (from /proc/version):
Linux version 2.6.9 (root@ducttape.mlyniec) (gcc version 3.3.3 20040412
(Red Hat Linux 3.3.3-7)) #2 Thu Oct 28 17:06:01 CEST 2004

[5.] Output of Oops.. message (if applicable) with symbolic information
      resolved (see Documentation/oops-tracing.txt
ksymoops 2.4.9 on i686 2.6.5-1.358smp.  Options used
      -v /usr/src/linux-2.6.9/vmlinux (specified)
      -K (specified)
      -L (specified)
      -O (specified)
      -m /usr/src/linux-2.6.9/System.map (specified)

Oct 31 15:02:38 cerber kernel: Unable to handle kernel paging request at
virtual address 00100100
Oct 31 15:02:38 cerber kernel: *pde = 00000000
Oct 31 15:02:38 cerber kernel: Oops: 0000 [#1]
Oct 31 15:02:38 cerber kernel: c03728c8
Oct 31 15:02:38 cerber kernel: CPU:    0
Oct 31 15:02:38 cerber kernel: EIP:    0060:[<c03728c8>]    Not tainted VLI
Using defaults from ksymoops -t elf32-i386 -a i386
Oct 31 15:02:38 cerber kernel: EFLAGS: 00010286   (2.6.9)
Oct 31 15:02:38 cerber kernel: eax: 001000b8   ebx: 001000b8   ecx:
f0166048   edx: 00100100
Oct 31 15:02:38 cerber kernel: esi: f7c1b12c   edi: 00010000   ebp:
f7c1b000   esp: f1b47c54
Oct 31 15:02:38 cerber kernel: ds: 007b   es: 007b   ss: 0068
Oct 31 15:02:38 cerber kernel: Stack: ef3aaa10 eedd6000 c037325c
c0373353 00000000 00000000 00000000 00000000
Oct 31 15:02:38 cerber kernel:        00000000 000008dc 00000002
00000000 f23de000 c01044e5 f1b47c94 00000002
Oct 31 15:02:38 cerber kernel:        00000000 ffffffff c18fcf80
ef3aaa00 f5019680 f23de000 00000000 f5019680
Oct 31 15:02:38 cerber kernel: Call Trace:
Oct 31 15:02:38 cerber kernel:  [<c037325c>] tc_modify_qdisc+0x0/0x6e3
Oct 31 15:02:38 cerber kernel:  [<c0373353>] tc_modify_qdisc+0xf7/0x6e3
Oct 31 15:02:38 cerber kernel:  [<c01044e5>] error_code+0x2d/0x38
Oct 31 15:02:38 cerber kernel:  [<c037325c>] tc_modify_qdisc+0x0/0x6e3
Oct 31 15:02:38 cerber kernel:  [<c036bb36>] rtnetlink_rcv+0x2af/0x359
Oct 31 15:02:38 cerber kernel:  [<c036b887>] rtnetlink_rcv+0x0/0x359
Oct 31 15:02:38 cerber kernel:  [<c038add3>] netlink_data_ready+0x55/0x5d
Oct 31 15:02:38 cerber kernel:  [<c038a49f>] netlink_sendskb+0x8a/0x8c
Oct 31 15:02:38 cerber kernel:  [<c038aac5>] netlink_sendmsg+0x1d7/0x2af
Oct 31 15:02:38 cerber kernel:  [<c03595c9>] sock_sendmsg+0xcc/0xe6
Oct 31 15:02:38 cerber kernel:  [<c03596fb>] sock_recvmsg+0xdc/0xf7
Oct 31 15:02:38 cerber kernel:  [<c0140657>] buffered_rmqueue+0xcf/0x27a
Oct 31 15:02:38 cerber kernel:  [<c039333b>] ip_forward_finish+0x26/0x4b
Oct 31 15:02:38 cerber kernel:  [<c011a0f2>]
autoremove_wake_function+0x0/0x43
Oct 31 15:02:38 cerber kernel:  [<c021586d>] copy_from_user+0x54/0x83
Oct 31 15:02:38 cerber kernel:  [<c035f4fa>] verify_iovec+0x2a/0x74
Oct 31 15:02:38 cerber kernel:  [<c035ab74>] sys_sendmsg+0x14c/0x197
Oct 31 15:02:38 cerber kernel:  [<c015001e>] handle_mm_fault+0x11d/0x2cf
Oct 31 15:02:38 cerber kernel:  [<c03593fb>] sockfd_lookup+0x16/0x6e
Oct 31 15:02:38 cerber kernel:  [<c035a921>] sys_setsockopt+0x69/0x9e
Oct 31 15:02:38 cerber kernel:  [<c035af9f>] sys_socketcall+0x22b/0x249
Oct 31 15:02:38 cerber kernel:  [<c0112ca2>] do_page_fault+0x0/0x52b
Oct 31 15:02:38 cerber kernel:  [<c01044e5>] error_code+0x2d/0x38
Oct 31 15:02:38 cerber kernel:  [<c01042e9>] sysenter_past_esp+0x52/0x71
Oct 31 15:02:38 cerber kernel: Code: 89 d7 56 53 8b 88 2c 01 00 00 8d 59
b8 8b 53 48 0f 18 02 90 8d b0 2c 01 00 00 39 f1 74 18 39 7b 14 74 19 8b
53 48 8d 42 b8 89 c3 <8b> 40 48 0f 18 00 90 39 f2 75 e8 31 c0 5b 5e 5f
c3 89 d8 eb f8

>>EIP; c03728c8 <qdisc_lookup+2c/41>   <=====


>>ecx; f0166048 <pg0+2faf4048/3f98c400>
>>esi; f7c1b12c <pg0+375a912c/3f98c400>
>>ebp; f7c1b000 <pg0+375a9000/3f98c400>
>>esp; f1b47c54 <pg0+314d5c54/3f98c400>


Trace; c037325c <tc_modify_qdisc+0/6e3>
Trace; c0373353 <tc_modify_qdisc+f7/6e3>
Trace; c01044e5 <error_code+2d/38>
Trace; c037325c <tc_modify_qdisc+0/6e3>
Trace; c036bb36 <rtnetlink_rcv+2af/359>
Trace; c036b887 <rtnetlink_rcv+0/359>
Trace; c038add3 <netlink_data_ready+55/5d>
Trace; c038a49f <netlink_sendskb+8a/8c>
Trace; c038aac5 <netlink_sendmsg+1d7/2af>
Trace; c03595c9 <sock_sendmsg+cc/e6>
Trace; c03596fb <sock_recvmsg+dc/f7>
Trace; c0140657 <buffered_rmqueue+cf/27a>
Trace; c039333b <ip_forward_finish+26/4b>
Trace; c011a0f2 <autoremove_wake_function+0/43>
Trace; c021586d <copy_from_user+54/83>
Trace; c035f4fa <verify_iovec+2a/74>
Trace; c035ab74 <sys_sendmsg+14c/197>
Trace; c015001e <handle_mm_fault+11d/2cf>
Trace; c03593fb <sockfd_lookup+16/6e>
Trace; c035a921 <sys_setsockopt+69/9e>
Trace; c035af9f <sys_socketcall+22b/249>
Trace; c0112ca2 <do_page_fault+0/52b>
Trace; c01044e5 <error_code+2d/38>
Trace; c01042e9 <sysenter_past_esp+52/71>

This architecture has variable length instructions, decoding before eip
is unreliable, take these instructions with a pinch of salt.


Code;  c037289d <qdisc_lookup+1/41>
00000000 <_EIP>:
Code;  c037289d <qdisc_lookup+1/41>
    0:   89 d7                     mov    %edx,%edi
Code;  c037289f <qdisc_lookup+3/41>
    2:   56                        push   %esi
Code;  c03728a0 <qdisc_lookup+4/41>
    3:   53                        push   %ebx
Code;  c03728a1 <qdisc_lookup+5/41>
    4:   8b 88 2c 01 00 00         mov    0x12c(%eax),%ecx
Code;  c03728a7 <qdisc_lookup+b/41>
    a:   8d 59 b8                  lea    0xffffffb8(%ecx),%ebx
Code;  c03728aa <qdisc_lookup+e/41>
    d:   8b 53 48                  mov    0x48(%ebx),%edx
Code;  c03728ad <qdisc_lookup+11/41>
   10:   0f 18 02                  prefetchnta (%edx)
Code;  c03728b0 <qdisc_lookup+14/41>
   13:   90                        nop
Code;  c03728b1 <qdisc_lookup+15/41>
   14:   8d b0 2c 01 00 00         lea    0x12c(%eax),%esi
Code;  c03728b7 <qdisc_lookup+1b/41>
   1a:   39 f1                     cmp    %esi,%ecx
Code;  c03728b9 <qdisc_lookup+1d/41>
   1c:   74 18                     je     36 <_EIP+0x36>
Code;  c03728bb <qdisc_lookup+1f/41>
   1e:   39 7b 14                  cmp    %edi,0x14(%ebx)
Code;  c03728be <qdisc_lookup+22/41>
   21:   74 19                     je     3c <_EIP+0x3c>
Code;  c03728c0 <qdisc_lookup+24/41>
   23:   8b 53 48                  mov    0x48(%ebx),%edx
Code;  c03728c3 <qdisc_lookup+27/41>
   26:   8d 42 b8                  lea    0xffffffb8(%edx),%eax
Code;  c03728c6 <qdisc_lookup+2a/41>
   29:   89 c3                     mov    %eax,%ebx

This decode from eip onwards should be reliable


Code;  c03728c8 <qdisc_lookup+2c/41>
00000000 <_EIP>:
Code;  c03728c8 <qdisc_lookup+2c/41>   <=====
    0:   8b 40 48                  mov    0x48(%eax),%eax   <=====
Code;  c03728cb <qdisc_lookup+2f/41>
    3:   0f 18 00                  prefetchnta (%eax)
Code;  c03728ce <qdisc_lookup+32/41>
    6:   90                        nop
Code;  c03728cf <qdisc_lookup+33/41>
    7:   39 f2                     cmp    %esi,%edx
Code;  c03728d1 <qdisc_lookup+35/41>
    9:   75 e8                     jne    fffffff3 <_EIP+0xfffffff3>
Code;  c03728d3 <qdisc_lookup+37/41>
    b:   31 c0                     xor    %eax,%eax
Code;  c03728d5 <qdisc_lookup+39/41>
    d:   5b                        pop    %ebx
Code;  c03728d6 <qdisc_lookup+3a/41>
    e:   5e                        pop    %esi
Code;  c03728d7 <qdisc_lookup+3b/41>
    f:   5f                        pop    %edi
Code;  c03728d8 <qdisc_lookup+3c/41>
   10:   c3                        ret
Code;  c03728d9 <qdisc_lookup+3d/41>
   11:   89 d8                     mov    %ebx,%eax
Code;  c03728db <qdisc_lookup+3f/41>
   13:   eb f8                     jmp    d <_EIP+0xd>

[6.] A small shell script or example program which triggers the
      problem (if possible)
my traffic shaping scripts are rather huge and they don't always cause
kernel oops. I tried to run them together (so classes and qdisc on every
interface were changed in parallel), but it didn't help.
I can send you them if you wish.

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)
This is output from where the kernel was compiled. This is machine with
the same hardware setup and the same Linux distro, but with developing
packages installed.
Gnu C                  3.3.3
Gnu make               3.80
binutils               2.15.90.0.3
util-linux             2.12
mount                  2.12
module-init-tools      2.4.26
e2fsprogs              1.35
reiserfsprogs          line
reiser4progs           line
pcmcia-cs              3.2.7
quota-tools            3.10.
PPP                    2.4.2
isdn4k-utils           3.3
nfs-utils              1.0.6
Linux C Library        2.3.3
Dynamic linker (ldd)   2.3.3
Procps                 3.2.0
Net-tools              1.60
Kbd                    1.12
Sh-utils               5.2.1

[7.2.] Processor information (from /proc/cpuinfo):
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping        : 5
cpu MHz         : 2814.286
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 5554.17

This is HT P4, but SMP was disabled.

[7.3.] Module information (from /proc/modules):
This is static kernel, no modules.

[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
cat /proc/ioports
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : libata
02f8-02ff : serial
0376-0376 : ide1
03c0-03df : vga+
03f8-03ff : serial
0cf8-0cff : PCI conf1
1000-107f : 0000:00:1f.0
1080-10bf : 0000:00:1f.0
1400-141f : 0000:00:1f.3
a000-a03f : 0000:02:00.0
   a000-a03f : e1000
a400-a43f : 0000:02:01.0
   a400-a43f : e1000
a800-a83f : 0000:02:02.0
   a800-a83f : e1000
ac00-ac3f : 0000:02:03.0
   ac00-ac3f : e1000
f000-f00f : 0000:00:1f.2
   f000-f00f : libata

  cat /proc/iomem
00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000ccbff : Video ROM
000d0000-000d0fff : Adapter ROM
000d1000-000d1fff : Adapter ROM
000d2000-000d2fff : Adapter ROM
000d3000-000d3fff : Adapter ROM
000f0000-000fffff : System ROM
00100000-3ffeffff : System RAM
   00100000-004215e9 : Kernel code
   004215ea-0057e3ff : Kernel data
3fff0000-3fff2fff : ACPI Non-volatile Storage
3fff3000-3fffffff : ACPI Tables
e8000000-efffffff : 0000:00:00.0
f0000000-f7ffffff : PCI Bus #01
   f0000000-f7ffffff : 0000:01:00.0
f8000000-f9ffffff : PCI Bus #01
   f8000000-f8ffffff : 0000:01:00.0
fb000000-fb01ffff : 0000:02:00.0
   fb000000-fb01ffff : e1000
fb020000-fb03ffff : 0000:02:00.0
   fb020000-fb03ffff : e1000
fb040000-fb05ffff : 0000:02:01.0
   fb040000-fb05ffff : e1000
fb060000-fb07ffff : 0000:02:01.0
   fb060000-fb07ffff : e1000
fb080000-fb09ffff : 0000:02:02.0
   fb080000-fb09ffff : e1000
fb0a0000-fb0bffff : 0000:02:02.0
   fb0a0000-fb0bffff : e1000
fb0c0000-fb0dffff : 0000:02:03.0
   fb0c0000-fb0dffff : e1000
fb0e0000-fb0fffff : 0000:02:03.0
   fb0e0000-fb0fffff : e1000
fec00000-ffffffff : reserved

[7.5.] PCI information ('lspci -vvv' as root)
In short: 4 x Intel 1000 MT carts, Marvell/Yukon integreated GbE disabled.


00:00.0 Host bridge: Intel Corp. 82865G/PE/P DRAM Controller/Host-Hub
Interface (rev 02)
         Subsystem: Giga-byte Technology GA-8IPE1000 Pro2 motherboard
(865PE)
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
         Latency: 0
         Region 0: Memory at e8000000 (32-bit, prefetchable)
         Capabilities: [e4] #09 [2106]
         Capabilities: [a0] AGP version 3.0
                 Status: RQ=32 Iso- ArqSz=2 Cal=0 SBA+ ITACoh- GART64-
HTrans- 64bit- FW+ AGP3- Rate=x1,x2,x4
                 Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit-
FW- Rate=<none>

00:01.0 PCI bridge: Intel Corp. 82865G/PE/P PCI to AGP Controller (rev
02) (prog-if 00 [Normal decode])
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
         Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
         Latency: 64
         Bus: primary=00, secondary=01, subordinate=01, sec-latency=32
         I/O behind bridge: 0000f000-00000fff
         Memory behind bridge: f8000000-f9ffffff
         Prefetchable memory behind bridge: f0000000-f7ffffff
         BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B-

00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB/EB/ER Hub interface to PCI
Bridge (rev c2) (prog-if 00 [Normal decode])
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
         Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
         Latency: 0
         Bus: primary=00, secondary=02, subordinate=02, sec-latency=32
         I/O behind bridge: 0000a000-0000afff
         Memory behind bridge: fa000000-fbffffff
         Prefetchable memory behind bridge: fff00000-000fffff
         BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-

00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Bridge (rev 02)
         Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 0

00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage
Controller (rev 02) (prog-if 8a [Master SecP PriP])
         Subsystem: Giga-byte Technology GA-8IPE1000 Pro2 motherboard
(865PE)
         Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 0
         Interrupt: pin A routed to IRQ 11
         Region 0: I/O ports at <unassigned>
         Region 1: I/O ports at <unassigned>
         Region 2: I/O ports at <unassigned>
         Region 3: I/O ports at <unassigned>
         Region 4: I/O ports at f000 [size=16]

00:1f.3 SMBus: Intel Corp. 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev 02)
         Subsystem: Giga-byte Technology GA-8IPE1000 Pro2 motherboard
(865PE)
         Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Interrupt: pin B routed to IRQ 9
         Region 4: I/O ports at 1400 [size=32]

01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2
MX/MX 400] (rev b2) (prog-if 00 [VGA])
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 32 (1250ns min, 250ns max)
         Interrupt: pin A routed to IRQ 5
         Region 0: Memory at f8000000 (32-bit, non-prefetchable)
         Region 1: Memory at f0000000 (32-bit, prefetchable) [size=128M]
         Capabilities: [60] Power Management version 2
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [44] AGP version 2.0
                 Status: RQ=32 Iso- ArqSz=0 Cal=0 SBA- ITACoh- GART64-
HTrans- 64bit- FW- AGP3- Rate=x1,x2,x4
                 Command: RQ=1 ArqSz=0 Cal=0 SBA- AGP- GART64- 64bit-
FW- Rate=<none>

02:00.0 Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet
Controller
         Subsystem: Intel Corp. PRO/1000 MT Desktop Adapter
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 32 (63750ns min), Cache Line Size 08
         Interrupt: pin A routed to IRQ 11
         Region 0: Memory at fb020000 (32-bit, non-prefetchable)
         Region 1: Memory at fb000000 (32-bit, non-prefetchable) [size=128K]
         Region 2: I/O ports at a000 [size=64]
         Capabilities: [dc] Power Management version 2
                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 PME-Enable- DSel=0 DScale=1 PME-
         Capabilities: [e4] PCI-X non-bridge device.
                 Command: DPERE- ERO+ RBC=0 OST=0
                 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
DC=simple, DMMRBC=2, DMOST=0, DMCRS=0, RSCEM-
         Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
                 Address: 0000000000000000  Data: 0000

02:01.0 Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet
Controller
         Subsystem: Intel Corp. PRO/1000 MT Desktop Adapter
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 32 (63750ns min), Cache Line Size 08
         Interrupt: pin A routed to IRQ 12
         Region 0: Memory at fb040000 (32-bit, non-prefetchable)
         Region 1: Memory at fb060000 (32-bit, non-prefetchable) [size=128K]
         Region 2: I/O ports at a400 [size=64]
         Capabilities: [dc] Power Management version 2
                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 PME-Enable- DSel=0 DScale=1 PME-
         Capabilities: [e4] PCI-X non-bridge device.
                 Command: DPERE- ERO+ RBC=0 OST=0
                 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
DC=simple, DMMRBC=2, DMOST=0, DMCRS=0, RSCEM-
         Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
                 Address: 0000000000000000  Data: 0000

02:02.0 Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet
Controller
         Subsystem: Intel Corp. PRO/1000 MT Desktop Adapter
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 32 (63750ns min), Cache Line Size 08
         Interrupt: pin A routed to IRQ 10
         Region 0: Memory at fb080000 (32-bit, non-prefetchable)
         Region 1: Memory at fb0a0000 (32-bit, non-prefetchable) [size=128K]
         Region 2: I/O ports at a800 [size=64]
         Capabilities: [dc] Power Management version 2
                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 PME-Enable- DSel=0 DScale=1 PME-
         Capabilities: [e4] PCI-X non-bridge device.
                 Command: DPERE- ERO+ RBC=0 OST=0
                 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
DC=simple, DMMRBC=2, DMOST=0, DMCRS=0, RSCEM-
         Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
                 Address: 0000000000000000  Data: 0000

02:03.0 Ethernet controller: Intel Corp. 82541GI/PI Gigabit Ethernet
Controller
         Subsystem: Intel Corp. PRO/1000 MT Desktop Adapter
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
         Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
         Latency: 32 (63750ns min), Cache Line Size 08
         Interrupt: pin A routed to IRQ 5
         Region 0: Memory at fb0c0000 (32-bit, non-prefetchable)
         Region 1: Memory at fb0e0000 (32-bit, non-prefetchable) [size=128K]
         Region 2: I/O ports at ac00 [size=64]
         Capabilities: [dc] Power Management version 2
                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 PME-Enable- DSel=0 DScale=1 PME-
         Capabilities: [e4] PCI-X non-bridge device.
                 Command: DPERE- ERO+ RBC=0 OST=0
                 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-,
DC=simple, DMMRBC=2, DMOST=0, DMCRS=0, RSCEM-
         Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
                 Address: 0000000000000000  Data: 0000

[7.6.] SCSI information (from /proc/scsi/scsi)
2 SATA Samsung drives.
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
   Vendor: ATA      Model: SAMSUNG SP0812C  Rev: SU10
   Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 01 Lun: 00
   Vendor: ATA      Model: SAMSUNG SP0812C  Rev: SU10
   Type:   Direct-Access                    ANSI SCSI revision: 05

[7.7.] Other information that might be relevant to the problem
        (please look in /proc and include all information that you
        think to be relevant):
The system is Fedora Core 2.
The default kernel has serious problems with traffic shaping, that was
resolved in 2.6.9 kernel line.
The kernel is vanilla 2.6.9.
After recompiling the kernel I recompiled iproute using vanilla
iproute2-2.6.9-041019.tar.gz
I use PSCHED_CPU.
Without recompiling iproute, it hanged every time two iproute commands
were run in parallel.

[X.] Other notes, patches, fixes, workarounds:
None of those.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-05  9:48 PROBLEM: IProute hangs after running traffic shaping scripts Szymon Miotk
@ 2004-11-05 11:54 ` Thomas Graf
  2004-11-05 14:16   ` [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs Thomas Graf
  2004-11-07 22:22 ` PROBLEM: IProute hangs after running traffic shaping scripts Patrick McHardy
  1 sibling, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-05 11:54 UTC (permalink / raw)
  To: Szymon Miotk; +Cc: netdev, kuznet, jmorris

> [2.] Full description of the problem/report:
> I have a server with 3 links to ISPs and 1 link for internal network.
> I shape my clients to certain speeds, depending on the time of the day.
> I have HTB shaping on each interface, about 2250 classes and 2250 qdisc
> on each, so it makes total ~9000 classes (HTB) and ~9000 qdisc (SFQ).
> I run shaping scripts 4 times/day.

Could you send me the rules via private email or explain the
basic architecutre of it?

My first thought was that you might run out of stack space while
dumping the qdisc tree but it doesn't seem so regarding your
oops reports.

> Sometimes it makes a kernel oops, hangs at some 'tc ...' command (it
> differs).
> Then the shaping works so-so (usually it works, but doesn't fully
> utilize the bandwidth) and every iproute command hangs.
> Killing the hanging processes kills them, but still every iproute
> command hangs, including ip and tc.
> Sometimes the server stops forwarding, but usually it does so few hours
> after kernel oops.
> Reboot always helps.A

There have been at least 2 slab corruptions in CBQ and some
might have been ported over to HTB. I will look into it.

> Oct 31 15:02:38 cerber kernel:  [<c037325c>] tc_modify_qdisc+0x0/0x6e3
> Oct 31 15:02:38 cerber kernel:  [<c0373353>] tc_modify_qdisc+0xf7/0x6e3
> Oct 31 15:02:38 cerber kernel:  [<c01044e5>] error_code+0x2d/0x38
> Oct 31 15:02:38 cerber kernel:  [<c037325c>] tc_modify_qdisc+0x0/0x6e3

Something is wrong here and qdisc_lookup has something to do
with it. I will audit qdisc_list changes.

> my traffic shaping scripts are rather huge and they don't always cause
> kernel oops. I tried to run them together (so classes and qdisc on every
> interface were changed in parallel), but it didn't help.
> I can send you them if you wish.

My first guess is that something corrupts qdisc_list of the
device. I'm not sure what causes this but I will look into it
today.

> The default kernel has serious problems with traffic shaping, that was
> resolved in 2.6.9 kernel line.

Can you be more exact? Those were different issues, right?

> Without recompiling iproute, it hanged every time two iproute commands
> were run in parallel.

iproute2 problem or kernel problem? Can you give an example how
to trigger it?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 11:54 ` Thomas Graf
@ 2004-11-05 14:16   ` Thomas Graf
  2004-11-05 16:12     ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-05 14:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, spam, kuznet, jmorris

> Oct 31 15:02:38 cerber kernel: Unable to handle kernel paging request at
> virtual address 00100100

The conversion to the generic list API was nice, it really helps
in such a case when a poison value shows up in an oops. However
it is not nice if new bugs are introduced with such changes.

-   if (dev) {
-       struct Qdisc *q, **qp;
-       for (qp = &qdisc->dev->qdisc_list; (q=*qp) != NULL; qp = &q->next) {
-           if (q == qdisc) {
-               *qp = q->next;
-               break;
-           }
-       }
-   }
+   list_del(&qdisc->list);

Nice cleanup, although it assumes that everyone calling qdisc_destroy provides a
Qdisc which is actually linked or at least has an initialized list node. This
was not the true for noqueue and noop dummy qdiscs.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

--- linux-2.6.10-rc1-bk14.orig/net/sched/sch_generic.c	2004-11-05 01:11:17.000000000 +0100
+++ linux-2.6.10-rc1-bk14/net/sched/sch_generic.c	2004-11-05 15:05:54.000000000 +0100
@@ -280,6 +280,7 @@
 	.dequeue	=	noop_dequeue,
 	.flags		=	TCQ_F_BUILTIN,
 	.ops		=	&noop_qdisc_ops,	
+	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
 };
 
 struct Qdisc_ops noqueue_qdisc_ops = {
@@ -298,6 +299,7 @@
 	.dequeue	=	noop_dequeue,
 	.flags		=	TCQ_F_BUILTIN,
 	.ops		=	&noqueue_qdisc_ops,
+	.list		=	LIST_HEAD_INIT(noqueue_qdisc.list),
 };
 
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 14:16   ` [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs Thomas Graf
@ 2004-11-05 16:12     ` Patrick McHardy
  2004-11-05 16:39       ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-05 16:12 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>-   if (dev) {
>-       struct Qdisc *q, **qp;
>-       for (qp = &qdisc->dev->qdisc_list; (q=*qp) != NULL; qp = &q->next) {
>-           if (q == qdisc) {
>-               *qp = q->next;
>-               break;
>-           }
>-       }
>-   }
>+   list_del(&qdisc->list);
>
>Nice cleanup, although it assumes that everyone calling qdisc_destroy provides a
>Qdisc which is actually linked or at least has an initialized list node. This
>was not the true for noqueue and noop dummy qdiscs.
>
Nice catch. I can't understand how you triggered that oops,
though. The noop- and noqueue-qdiscs used without qdisc_create_*
are not refcounted, so I would expect:

void qdisc_destroy(struct Qdisc *qdisc)
{
        if (!atomic_dec_and_test(&qdisc->refcnt))
                return;

to underflow and return until refcnt finally reaches 0 again.
Can you explain, please ?

Regards
Patrick

>Signed-off-by: Thomas Graf <tgraf@suug.ch>
>
>--- linux-2.6.10-rc1-bk14.orig/net/sched/sch_generic.c	2004-11-05 01:11:17.000000000 +0100
>+++ linux-2.6.10-rc1-bk14/net/sched/sch_generic.c	2004-11-05 15:05:54.000000000 +0100
>@@ -280,6 +280,7 @@
> 	.dequeue	=	noop_dequeue,
> 	.flags		=	TCQ_F_BUILTIN,
> 	.ops		=	&noop_qdisc_ops,	
>+	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
> };
> 
> struct Qdisc_ops noqueue_qdisc_ops = {
>@@ -298,6 +299,7 @@
> 	.dequeue	=	noop_dequeue,
> 	.flags		=	TCQ_F_BUILTIN,
> 	.ops		=	&noqueue_qdisc_ops,
>+	.list		=	LIST_HEAD_INIT(noqueue_qdisc.list),
> };
> 
> 
>  
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 16:12     ` Patrick McHardy
@ 2004-11-05 16:39       ` Thomas Graf
  2004-11-05 17:26         ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-05 16:39 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

* Patrick McHardy <418BA66A.60804@trash.net> 2004-11-05 17:12
> Nice catch. I can't understand how you triggered that oops,
> though. The noop- and noqueue-qdiscs used without qdisc_create_*
> are not refcounted, so I would expect:
> 
> void qdisc_destroy(struct Qdisc *qdisc)
> {
>        if (!atomic_dec_and_test(&qdisc->refcnt))
>                return;
> 
> to underflow and return until refcnt finally reaches 0 again.
> Can you explain, please ?

Right, this is indeed the case. This doesn't fix the oops reported
but will prevent the oops you are referring to which was triggerd
after 2h stress testing on my machine. I haven't had the time to
check if incrementing the refcnt of builtin qdiscs causes
any problems but initializing the list is good in any case.

I found huge locking problems from qidsc_destroy calling contexts though.
Almost all calling paths to qdisc_destroy invoked from qdiscs are messed up.
I am resolving those now. I have a theoretical path that could cause the
reported oops which is htb_put -> htb_destroy_class -> qdisc_destroy
not bh locking dev->queue_lock and thus the list unlinking could
take place during a walk/lookup and thus lead to POISON value in the
next pointer. I could not reproduce this so far though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 16:39       ` Thomas Graf
@ 2004-11-05 17:26         ` Patrick McHardy
  2004-11-05 17:58           ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-05 17:26 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>* Patrick McHardy <418BA66A.60804@trash.net> 2004-11-05 17:12
>  
>
>>Nice catch. I can't understand how you triggered that oops,
>>though. The noop- and noqueue-qdiscs used without qdisc_create_*
>>are not refcounted, so I would expect:
>>
>>void qdisc_destroy(struct Qdisc *qdisc)
>>{
>>       if (!atomic_dec_and_test(&qdisc->refcnt))
>>               return;
>>
>>to underflow and return until refcnt finally reaches 0 again.
>>Can you explain, please ?
>>    
>>
>
>Right, this is indeed the case. This doesn't fix the oops reported
>but will prevent the oops you are referring to which was triggerd
>after 2h stress testing on my machine. I haven't had the time to
>check if incrementing the refcnt of builtin qdiscs causes
>any problems but initializing the list is good in any case.
>  
>
Either refcnt them or add add some kind of flag to qdiscs created
by qdisc_create/qdisc_create_default and check for that flag.
Initializing the lists doesn't fix all problems, directly using
noop/noqueue doesn't increment the device refcnt, so is must not
be dropped it __qdisc_destroy.

>I found huge locking problems from qidsc_destroy calling contexts though.
>Almost all calling paths to qdisc_destroy invoked from qdiscs are messed up.
>I am resolving those now. I have a theoretical path that could cause the
>reported oops which is htb_put -> htb_destroy_class -> qdisc_destroy
>not bh locking dev->queue_lock and thus the list unlinking could
>take place during a walk/lookup and thus lead to POISON value in the
>next pointer. I could not reproduce this so far though.
>  
>
ops->put seems to be safe even without holding dev->queue_lock.
The class refcnt is only changed from userspace, and always under
the rtnl semaphore. get/put are always balanced, so pratically a
class can never get destroyed by put.

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 17:26         ` Patrick McHardy
@ 2004-11-05 17:58           ` Thomas Graf
  2004-11-05 18:18             ` Patrick McHardy
  2004-11-06  0:36             ` David S. Miller
  0 siblings, 2 replies; 34+ messages in thread
From: Thomas Graf @ 2004-11-05 17:58 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

* Patrick McHardy <418BB7D2.6060908@trash.net> 2004-11-05 18:26
> ops->put seems to be safe even without holding dev->queue_lock.
> The class refcnt is only changed from userspace, and always under
> the rtnl semaphore. get/put are always balanced, so pratically a
> class can never get destroyed by put.

You are right, this cannot be the problem. However, there is a
potential risk in qdisc_destroy if dev->queue_lock is not held.
I'm not sure but aren't all callers to qdisc_destroy holding
qdisc_lock_tree(dev) such as dev_shutdown a potential risk to
deadlocks because __qdisc_destroy tries to lock again?

> Either refcnt them or add add some kind of flag to qdiscs created
> by qdisc_create/qdisc_create_default and check for that flag.
> Initializing the lists doesn't fix all problems, directly using
> noop/noqueue doesn't increment the device refcnt, so is must not
> be dropped it __qdisc_destroy.

I was irritated by the TCQ_F_BUILTIN check in __qdisc_destroy. None
of the code in __qdisc_destroy should be applied to a builtin qdisc
or am I missing something?

The patch below prevents builtin qdiscs from being destroyed and
fixes a refcnt underflow whould lead to a bogus list unlinking
and dev_put.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

--- linux-2.6.10-rc1-bk14.orig/net/sched/sch_generic.c	2004-11-05 18:44:49.000000000 +0100
+++ linux-2.6.10-rc1-bk14/net/sched/sch_generic.c	2004-11-05 18:43:52.000000000 +0100
@@ -479,15 +479,15 @@
 	module_put(ops->owner);
 
 	dev_put(qdisc->dev);
-	if (!(qdisc->flags&TCQ_F_BUILTIN))
-		kfree((char *) qdisc - qdisc->padded);
+	kfree((char *) qdisc - qdisc->padded);
 }
 
 /* Under dev->queue_lock and BH! */
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
-	if (!atomic_dec_and_test(&qdisc->refcnt))
+	if (qdisc->flags & TCQ_F_BUILTIN ||
+		!atomic_dec_and_test(&qdisc->refcnt))
 		return;
 	list_del(&qdisc->list);
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 17:58           ` Thomas Graf
@ 2004-11-05 18:18             ` Patrick McHardy
  2004-11-05 19:43               ` Thomas Graf
  2004-11-06  0:36             ` David S. Miller
  1 sibling, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-05 18:18 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>* Patrick McHardy <418BB7D2.6060908@trash.net> 2004-11-05 18:26
>  
>
>>ops->put seems to be safe even without holding dev->queue_lock.
>>The class refcnt is only changed from userspace, and always under
>>the rtnl semaphore. get/put are always balanced, so pratically a
>>class can never get destroyed by put.
>>    
>>
>
>You are right, this cannot be the problem. However, there is a
>potential risk in qdisc_destroy if dev->queue_lock is not held.
>  
>
Yes, but there doesn't seem to be a path where this is true.

>I'm not sure but aren't all callers to qdisc_destroy holding
>qdisc_lock_tree(dev) such as dev_shutdown a potential risk to
>deadlocks because __qdisc_destroy tries to lock again?
>  
>
__qdisc_destroy is called from a rcu-callback, not directly from
qdisc_destroy.

>>Either refcnt them or add add some kind of flag to qdiscs created
>>by qdisc_create/qdisc_create_default and check for that flag.
>>Initializing the lists doesn't fix all problems, directly using
>>noop/noqueue doesn't increment the device refcnt, so is must not
>>be dropped it __qdisc_destroy.
>>    
>>
>
>I was irritated by the TCQ_F_BUILTIN check in __qdisc_destroy. None
>of the code in __qdisc_destroy should be applied to a builtin qdisc
>or am I missing something?
>  
>
No, your patch looks fine.

Regards
Patrick

>The patch below prevents builtin qdiscs from being destroyed and
>fixes a refcnt underflow whould lead to a bogus list unlinking
>and dev_put.
>
>Signed-off-by: Thomas Graf <tgraf@suug.ch>
>
>--- linux-2.6.10-rc1-bk14.orig/net/sched/sch_generic.c	2004-11-05 18:44:49.000000000 +0100
>+++ linux-2.6.10-rc1-bk14/net/sched/sch_generic.c	2004-11-05 18:43:52.000000000 +0100
>@@ -479,15 +479,15 @@
> 	module_put(ops->owner);
> 
> 	dev_put(qdisc->dev);
>-	if (!(qdisc->flags&TCQ_F_BUILTIN))
>-		kfree((char *) qdisc - qdisc->padded);
>+	kfree((char *) qdisc - qdisc->padded);
> }
> 
> /* Under dev->queue_lock and BH! */
> 
> void qdisc_destroy(struct Qdisc *qdisc)
> {
>-	if (!atomic_dec_and_test(&qdisc->refcnt))
>+	if (qdisc->flags & TCQ_F_BUILTIN ||
>+		!atomic_dec_and_test(&qdisc->refcnt))
> 		return;
> 	list_del(&qdisc->list);
> 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
>
>
>  
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 18:18             ` Patrick McHardy
@ 2004-11-05 19:43               ` Thomas Graf
  2004-11-06  1:18                 ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-05 19:43 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

* Patrick McHardy <418BC40E.8080402@trash.net> 2004-11-05 19:18
> Yes, but there doesn't seem to be a path where this is true.

I will double check this, there must be something wrong with
one of the callers to qdisc_destroy since it's the only way
entries can be removed from qdisc_list and otherwise the oops wouldn't
show up with the POISON1.

> __qdisc_destroy is called from a rcu-callback, not directly from
> qdisc_destroy.

You're indeed right, I haven't noticed queue_lock being a softirq
spinlock, stupid me.

So to have qdisc_destroy be reliable we should either make sure
dev->queue_lock is held for every call or switch to list_del_rcu.
since qdisc_destroy might be called from bh context.

I will audit all paths, I'm pretty sure the reported oops is
somehow related to this.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 17:58           ` Thomas Graf
  2004-11-05 18:18             ` Patrick McHardy
@ 2004-11-06  0:36             ` David S. Miller
  1 sibling, 0 replies; 34+ messages in thread
From: David S. Miller @ 2004-11-06  0:36 UTC (permalink / raw)
  To: Thomas Graf; +Cc: kaber, netdev, spam, kuznet, jmorris

On Fri, 5 Nov 2004 18:58:12 +0100
Thomas Graf <tgraf@suug.ch> wrote:

> I was irritated by the TCQ_F_BUILTIN check in __qdisc_destroy. None
> of the code in __qdisc_destroy should be applied to a builtin qdisc
> or am I missing something?
> 
> The patch below prevents builtin qdiscs from being destroyed and
> fixes a refcnt underflow whould lead to a bogus list unlinking
> and dev_put.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>

I'll apply this, along with the builting qdisc list initializer
change too just for cleanliness, thanks Thomas.

Ingo Molnar noticed this refcount underflow condition in qdisc_destroy()
a couple weeks ago.  At the time I didn't connect it to being a builting
qdisc issue.  Now we know the true cause.

Thanks again.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-05 19:43               ` Thomas Graf
@ 2004-11-06  1:18                 ` Thomas Graf
  2004-11-06  1:47                   ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-06  1:18 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

* Thomas Graf <20041105194303.GA12289@postel.suug.ch> 2004-11-05 20:43
> * Patrick McHardy <418BC40E.8080402@trash.net> 2004-11-05 19:18
> > Yes, but there doesn't seem to be a path where this is true.
> 
> I will double check this, there must be something wrong with
> one of the callers to qdisc_destroy since it's the only way
> entries can be removed from qdisc_list and otherwise the oops wouldn't
> show up with the POISON1.

I think I've found the problem and it sounds to banal to be true.
qdisc_destroy uses list_del when it used to use a manual list
management which would not interfer any list walkers. the list
walkers are not protected on UP systems, that's why it was
impossible to trigger it for me with SMP enabled. So a qdisc_list
walker such as qdisc_lookup could have its next pointer overwritten
with LIST_POISON1 while walking if qdisc_destroy is called in between?
It sounds so right but on the other hand all the callers except
dev_shutdown should be serialized with any of the list walkers by the
rtnl sempaphore.

I guess I'm missing that little bit of experience but it would
help to use _rcu list variantes, would it?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-06  1:18                 ` Thomas Graf
@ 2004-11-06  1:47                   ` Patrick McHardy
  2004-11-06  1:59                     ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-06  1:47 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>I think I've found the problem and it sounds to banal to be true.
>qdisc_destroy uses list_del when it used to use a manual list
>management which would not interfer any list walkers.
>
Nice work, I was really beginning to wonder. There's not that
much that touches qdisc_list :)

> the list
>walkers are not protected on UP systems, that's why it was
>impossible to trigger it for me with SMP enabled. So a qdisc_list
>walker such as qdisc_lookup could have its next pointer overwritten
>with LIST_POISON1 while walking if qdisc_destroy is called in between?
>It sounds so right but on the other hand all the callers except
>dev_shutdown should be serialized with any of the list walkers by the
>rtnl sempaphore.
>
The __qdisc_destroy rcu-callback is called in softirq context, when
destroying a classful qdisc the qdisc destroy function might call
qdisc_destroy again for an inner class.

>I guess I'm missing that little bit of experience but it would
>help to use _rcu list variantes, would it?
>
Yes. Are you going to send a patch ?

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-06  1:47                   ` Patrick McHardy
@ 2004-11-06  1:59                     ` Thomas Graf
  2004-11-06 14:50                       ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-06  1:59 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

* Patrick McHardy <418C2D40.9020300@trash.net> 2004-11-06 02:47
> Nice work, I was really beginning to wonder. There's not that
> much that touches qdisc_list :)

Which might have been the problem, I've been focusing on this for
hours overlooking the obvious. The bug report was very well done,
maybe too good so I was at the root of the problem too fast. ;->

> The __qdisc_destroy rcu-callback is called in softirq context, when
> destroying a classful qdisc the qdisc destroy function might call
> qdisc_destroy again for an inner class.

Ahh... of course. I was looking through all callers of destroy multiple
times but somehow I managed to overlook the one in __qdisc_destroy
over and over.  Thanks!

> >I guess I'm missing that little bit of experience but it would
> >help to use _rcu list variantes, would it?
> >
> Yes. Are you going to send a patch ?

Sure, tomorrow, if nobody preempts. I need some rest.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-06  1:59                     ` Thomas Graf
@ 2004-11-06 14:50                       ` Thomas Graf
  2004-11-07  8:57                         ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-06 14:50 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

The patch below transforms qdisc_list into a RCU protected list.
However, this does not really fix the problem. The below
example shows what can happen in case a qdisc is deleted
which is classful and calls qdisc_destroy in its destroy
handler.

--->        RTM_DELQDISC message
            ...
            qdisc_destroy() sets up rcu callback __qdisc_destroy
<---        ...
--->        RCU callback __qdisc_destroy
            qdisc_destroy() sets up rcu callback __qdisc_destroy
<---        ...
--->        continueing RTM_DELQDISC message
            ...
<---        message finished
--->        RTM_GETQDISC
            ...
<---        p = qdisc_lookup() (qdisc_lookup itself is not interrupted)
--->        RCU callback __qdisc_destroy
<---        May free p 
--->        continueing RTM_GETQDISC message
            use of p references freed memory --> oops

So using the RCU protected list only enforces the unlinking
to be safe while within a list walk  and the rcu callback
to be deferred until the list walk is complete but may
be executed right after qdisc_lookup and then may free the
entry that was returned.

So we must either rcu_read_lock all codeblocks that use
a result of a qdisc_lookup which is not trivial or ensure
that all rcu callbacks have finished before the next rtnetlink
message is processed.

Maybe the easiest would be to have a refcnt which is
incremented when a rcu callback is set up and decremented
when it finishs and sleep on it after processing a
rtnetlink message until it reaches 0 again.

Thoughts?

Anyway, here's the patch. It's relative to my earlier patch
and Patrick's rcu_assign_ptr patch.

Transform qdisc_list into a RCU protected list to
avoid qdisc_destroy, which might be called from softirq context, to
overwrite the next pointer while a user context list walking is
taking place.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

diff -Nru linux-2.6.10-rc1-bk14.orig/net/sched/sch_api.c linux-2.6.10-rc1-bk14/net/sched/sch_api.c
--- linux-2.6.10-rc1-bk14.orig/net/sched/sch_api.c	2004-11-06 14:57:44.000000000 +0100
+++ linux-2.6.10-rc1-bk14/net/sched/sch_api.c	2004-11-06 14:52:22.000000000 +0100
@@ -196,10 +196,14 @@
 {
 	struct Qdisc *q;
 
-	list_for_each_entry(q, &dev->qdisc_list, list) {
-		if (q->handle == handle)
+	rcu_read_lock();
+	list_for_each_entry_rcu(q, &dev->qdisc_list, list) {
+		if (q->handle == handle) {
+			rcu_read_unlock();
 			return q;
+		}
 	}
+	rcu_read_unlock();
 	return NULL;
 }
 
@@ -453,7 +457,7 @@
 
 	if (!ops->init || (err = ops->init(sch, tca[TCA_OPTIONS-1])) == 0) {
 		qdisc_lock_tree(dev);
-		list_add_tail(&sch->list, &dev->qdisc_list);
+		list_add_tail_rcu(&sch->list, &dev->qdisc_list);
 		qdisc_unlock_tree(dev);
 
 #ifdef CONFIG_NET_ESTIMATOR
@@ -817,7 +821,7 @@
 			s_q_idx = 0;
 		read_lock_bh(&qdisc_tree_lock);
 		q_idx = 0;
-		list_for_each_entry(q, &dev->qdisc_list, list) {
+		list_for_each_entry_rcu(q, &dev->qdisc_list, list) {
 			if (q_idx < s_q_idx) {
 				q_idx++;
 				continue;
@@ -1054,7 +1058,7 @@
 	t = 0;
 
 	read_lock_bh(&qdisc_tree_lock);
-	list_for_each_entry(q, &dev->qdisc_list, list) {
+	list_for_each_entry_rcu(q, &dev->qdisc_list, list) {
 		if (t < s_t || !q->ops->cl_ops ||
 		    (tcm->tcm_parent &&
 		     TC_H_MAJ(tcm->tcm_parent) != q->handle)) {
diff -Nru linux-2.6.10-rc1-bk14.orig/net/sched/sch_generic.c linux-2.6.10-rc1-bk14/net/sched/sch_generic.c
--- linux-2.6.10-rc1-bk14.orig/net/sched/sch_generic.c	2004-11-06 14:57:44.000000000 +0100
+++ linux-2.6.10-rc1-bk14/net/sched/sch_generic.c	2004-11-06 15:16:16.000000000 +0100
@@ -486,7 +486,7 @@
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 		!atomic_dec_and_test(&qdisc->refcnt))
 		return;
-	list_del(&qdisc->list);
+	list_del_rcu(&qdisc->list);
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 
@@ -507,7 +507,7 @@
 				return;
 			}
 			write_lock_bh(&qdisc_tree_lock);
-			list_add_tail(&qdisc->list, &dev->qdisc_list);
+			list_add_tail_rcu(&qdisc->list, &dev->qdisc_list);
 			write_unlock_bh(&qdisc_tree_lock);
 		} else {
 			qdisc =  &noqueue_qdisc;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-06 14:50                       ` Thomas Graf
@ 2004-11-07  8:57                         ` Patrick McHardy
  2004-11-07 14:00                           ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-07  8:57 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>So using the RCU protected list only enforces the unlinking
>to be safe while within a list walk  and the rcu callback
>to be deferred until the list walk is complete but may
>be executed right after qdisc_lookup and then may free the
>entry that was returned.
>
>So we must either rcu_read_lock all codeblocks that use
>a result of a qdisc_lookup which is not trivial or ensure
>that all rcu callbacks have finished before the next rtnetlink
>message is processed.
>
>Maybe the easiest would be to have a refcnt which is
>incremented when a rcu callback is set up and decremented
>when it finishs and sleep on it after processing a
>rtnetlink message until it reaches 0 again.
>
>Thoughts?
>  
>
Before the RCU change distruction of the qdisc and all inner
qdiscs happend immediately and under the rtnl semaphore. This
made sure nothing holding the rtnl semaphore could end up with
invalid memory. This is not true anymore, qdiscs found on
dev->qdisc_list can be suddenly destroyed.

dev->qdisc_list is protected by qdisc_tree_lock everywhere but in
qdisc_lookup, this is also the only structure that is consistently
protected by this lock. To fix the list corruption we can either
protect qdisc_lookup with qdisc_tree_lock or use rcu-list macros
and remove all read_lock(&qdisc_tree_locks) (and replace it by
a spinlock).

Unfortunately, since we can not rely on the rtnl protection for
memory anymore, it seems we need to refcount all uses of
dev->qdisc_list that before relied on this protection and can't
use rcu_read_lock. To make this safe, we need to atomically
atomic_dec_and_test/list_del in qdisc_destroy and atomically do
list_for_each_entry/atomic_inc in qdisc_lookup, so we should
should simply keep the non-rcu lists and use qdisc_tree_lock
in qdisc_lookup.

I'm working on a patch, but it will take a few days.

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07  8:57                         ` Patrick McHardy
@ 2004-11-07 14:00                           ` Thomas Graf
  2004-11-07 16:19                             ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-07 14:00 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

> Before the RCU change distruction of the qdisc and all inner
> qdiscs happend immediately and under the rtnl semaphore. This
> made sure nothing holding the rtnl semaphore could end up with
> invalid memory. This is not true anymore, qdiscs found on
> dev->qdisc_list can be suddenly destroyed.

And we should switch back to this again if possible. I haven't
audited all paths to dev_activate but we have at most
a list addition which might not be protected with an
rtnl semaphore. I'm not 100% sure about this yet.

> dev->qdisc_list is protected by qdisc_tree_lock everywhere but in
> qdisc_lookup, this is also the only structure that is consistently
> protected by this lock. To fix the list corruption we can either
> protect qdisc_lookup with qdisc_tree_lock or use rcu-list macros
> and remove all read_lock(&qdisc_tree_locks) (and replace it by
> a spinlock).

qdisc_lookup was the only one not yet protected by a preempt
disable.

> Unfortunately, since we can not rely on the rtnl protection for
> memory anymore, it seems we need to refcount all uses of
> dev->qdisc_list that before relied on this protection and can't
> use rcu_read_lock.

There is no list iteration not yet protected by the rtnl semaphore
and the only interruption is because of the rcu callback.

> To make this safe, we need to atomically
> atomic_dec_and_test/list_del in qdisc_destroy and atomically do
> list_for_each_entry/atomic_inc in qdisc_lookup, so we should
> should simply keep the non-rcu lists and use qdisc_tree_lock
> in qdisc_lookup.

You mean before qdisc_lookup and until the reference is released
again? These are huge locking regions involving calls which might
sleep and possible qdisc_destroy calling paths.  So this won't
work quite well.

So in my opinion we should screw that call_rcu because it doesn't
make much sense. In case dev_activate is not synchronized with
rtnl sempaphore we have to make sure that qdisc_destroy always
locks on qdisc_tree_lock which is not the case for a few paths as
of now, although I'm not sure if any of those actually ever call
qdisc_destroy with refcnt==1.

If screwing call_rcu is not possible we can still do a refcnt
incremented before call_rcu in qdisc_destroy and every base
caller of qdisc_destroy (excluding those in qdisc destroy routines)
sleeps on it after it invoked qdisc_destroy and reached a safe
place to sleep. So we can make sure that the qdisc is really gone
after invoking qdisc_destroy. Otherwise we will always run into
troubles with new messages arriving and qdisc deletions still
pending.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07 14:00                           ` Thomas Graf
@ 2004-11-07 16:19                             ` Patrick McHardy
  2004-11-07 16:33                               ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-07 16:19 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>You mean before qdisc_lookup and until the reference is released
>again? These are huge locking regions involving calls which might
>sleep and possible qdisc_destroy calling paths.  So this won't
>work quite well.
>
Who cares about huge sections while holding reference counts ?
qdisc_destroy won't destroy the qdisc until all references have
been dropped, that's the whole point of it.

>So in my opinion we should screw that call_rcu because it doesn't
>make much sense. In case dev_activate is not synchronized with
>rtnl sempaphore we have to make sure that qdisc_destroy always
>locks on qdisc_tree_lock which is not the case for a few paths as
>of now, although I'm not sure if any of those actually ever call
>qdisc_destroy with refcnt==1.
>  
>
I'm also fixing up the remaining locking bugs, so before we start
reverting things lets just wait and see how it gets.

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07 16:19                             ` Patrick McHardy
@ 2004-11-07 16:33                               ` Thomas Graf
  2004-11-07 17:02                                 ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-07 16:33 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

> Who cares about huge sections while holding reference counts ?
> qdisc_destroy won't destroy the qdisc until all references have
> been dropped, that's the whole point of it.

I might have misunderstood you in this point, so you increment a
refcnt in qdisc_lookup and decrement in once you're done with
the reference? I thought you wanted to to bh spin locks. I'm not
sure how you want to do this without creating races.

Example:

We get a RTM_DELQDISC request so you'll increment refcnt in
qdisc_lookup and decrement it right before you call qdisc_destroy
so it actually can be deleted. The rcu callback works fine
and will set up a another rcu callback for the destroying of
the inner qdiscs. Right at this time you get a RTM_GETQDISC for
that inner qdisc so you'll lock on it, then the rcu callback
comes in and cannot delete the inner qdisc anymore. Do you want
to sleep in softirq context?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07 16:33                               ` Thomas Graf
@ 2004-11-07 17:02                                 ` Patrick McHardy
  2004-11-07 17:49                                   ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-07 17:02 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>I might have misunderstood you in this point, so you increment a
>refcnt in qdisc_lookup and decrement in once you're done with
>the reference? I thought you wanted to to bh spin locks. I'm not
>sure how you want to do this without creating races.
>
>Example:
>
>We get a RTM_DELQDISC request so you'll increment refcnt in
>qdisc_lookup and decrement it right before you call qdisc_destroy
>so it actually can be deleted. The rcu callback works fine
>and will set up a another rcu callback for the destroying of
>the inner qdiscs. Right at this time you get a RTM_GETQDISC for
>that inner qdisc so you'll lock on it, then the rcu callback
>comes in and cannot delete the inner qdisc anymore. Do you want
>to sleep in softirq context?
>
This can't happen. Once the inner qdiscs refcnt drops to zero
they are removed from the list, before the rcu callback is scheduled.
Once the RCU callback is scheduled it can't be found anymore.

BTW: An alternative, quite unintrusive solution is to prevent anyone
from finding the inner qdiscs after the outer one has been destroyed.
This can be done be keeping inner qdiscs on qdisc->qdisc_list and only
keep the top-level qdisc in struct net_device. Of course, this makes
walking all qdiscs more complicated. A generation counter for the
top-level qdisc should also work.

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07 17:02                                 ` Patrick McHardy
@ 2004-11-07 17:49                                   ` Thomas Graf
  2004-11-07 18:22                                     ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-07 17:49 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

> >We get a RTM_DELQDISC request so you'll increment refcnt in
> >qdisc_lookup and decrement it right before you call qdisc_destroy
> >so it actually can be deleted. The rcu callback works fine
> >and will set up a another rcu callback for the destroying of
> >the inner qdiscs. Right at this time you get a RTM_GETQDISC for
> >that inner qdisc so you'll lock on it, then the rcu callback
> >comes in and cannot delete the inner qdisc anymore. Do you want
> >to sleep in softirq context?
> >
> This can't happen. Once the inner qdiscs refcnt drops to zero
> they are removed from the list, before the rcu callback is scheduled.
> Once the RCU callback is scheduled it can't be found anymore.

Right, but what about when the RTM_GETQDISC comes in before the first
rcu callback is invoked? I'm not sure if this can happen in practice
though.

Anyways, I do think we should force the task to be completed, or
at least all the list unlinking, before the rtnl semaphore is given
back.  I'm fine with postponing the deletion of the object but not
to postpone list manipulations even if we cannot reproduce it now.

We might be able to get a working solution for now but
this unneeded async behaviour will definitely cause much troubles
in the future.

I think no function that can be invoked from softirq context should
ever have the chance to sleep even if there is no path at the
moment. Why bother about this when we can move the possible sleep into
a user context only area?

> BTW: An alternative, quite unintrusive solution is to prevent anyone
> from finding the inner qdiscs after the outer one has been destroyed.
> This can be done be keeping inner qdiscs on qdisc->qdisc_list and only
> keep the top-level qdisc in struct net_device. Of course, this makes
> walking all qdiscs more complicated. A generation counter for the
> top-level qdisc should also work.

I agree, that's rather non trivial.

I'll wait for your patch but it's wrong in my eyes ;->

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07 17:49                                   ` Thomas Graf
@ 2004-11-07 18:22                                     ` Patrick McHardy
  2004-11-07 19:08                                       ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-07 18:22 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, spam, kuznet, jmorris

Thomas Graf wrote:

>Anyways, I do think we should force the task to be completed, or
>at least all the list unlinking, before the rtnl semaphore is given
>back.  I'm fine with postponing the deletion of the object but not
>to postpone list manipulations even if we cannot reproduce it now.
>  
>

This is what I'm doing now, your patch to set qdisc->parent makes
this very easy :) Simply remove all qdiscs with
(TC_H_MAJ(q->parent) == TC_H_MAJ(qdisc->handle)) from
dev->qdisc_list when a classful qdisc is destroyed.

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs
  2004-11-07 18:22                                     ` Patrick McHardy
@ 2004-11-07 19:08                                       ` Thomas Graf
  0 siblings, 0 replies; 34+ messages in thread
From: Thomas Graf @ 2004-11-07 19:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: davem, netdev, spam, kuznet, jmorris

> >Anyways, I do think we should force the task to be completed, or
> >at least all the list unlinking, before the rtnl semaphore is given
> >back.  I'm fine with postponing the deletion of the object but not
> >to postpone list manipulations even if we cannot reproduce it now.
> 
> This is what I'm doing now, your patch to set qdisc->parent makes
> this very easy :) Simply remove all qdiscs with
> (TC_H_MAJ(q->parent) == TC_H_MAJ(qdisc->handle)) from
> dev->qdisc_list when a classful qdisc is destroyed.

Great, that's even better, speeds up deletion of big class
trees. Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-05  9:48 PROBLEM: IProute hangs after running traffic shaping scripts Szymon Miotk
  2004-11-05 11:54 ` Thomas Graf
@ 2004-11-07 22:22 ` Patrick McHardy
  2004-11-08  1:40   ` Patrick McHardy
  1 sibling, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-07 22:22 UTC (permalink / raw)
  To: Szymon Miotk; +Cc: netdev, Thomas Graf

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

Szymon Miotk wrote:

> This mail was posted 02-11-2004 to linux-net@vger.kernel.org, but I 
> got no response at all, so I am resending it.
>
> [1.] One line summary of the problem:
> IProute hangs after running traffic shaping scripts

Can you test if this patch helps please ? It immediately removes all
inner qdiscs of classful qdiscs from dev->qdisc_list, making it
impossible for anyone to look up an inner qdisc that is about to get
destroyed. Taking qdisc_tree_lock in qdisc_lookup is not really
necessary with this patch anymore, since all changes to dev->qdisc_list
happen under the rtnl again (which is also relied on for memory). I put
it in anyways for now until I have fully analyzed locking.

Regards
Patrick


[-- Attachment #2: x --]
[-- Type: text/plain, Size: 1532 bytes --]

===== net/sched/sch_api.c 1.42 vs edited =====
--- 1.42/net/sched/sch_api.c	2004-11-07 05:03:04 +01:00
+++ edited/net/sched/sch_api.c	2004-11-07 22:59:52 +01:00
@@ -194,13 +194,15 @@
 
 struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
 {
-	struct Qdisc *q;
+	struct Qdisc *q = NULL;
 
+	read_lock_bh(&qdisc_tree_lock);
 	list_for_each_entry(q, &dev->qdisc_list, list) {
 		if (q->handle == handle)
-			return q;
+			break;
 	}
-	return NULL;
+	read_unlock_bh(&qdisc_tree_lock);
+	return q;
 }
 
 struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
===== net/sched/sch_generic.c 1.30 vs edited =====
--- 1.30/net/sched/sch_generic.c	2004-11-06 01:34:45 +01:00
+++ edited/net/sched/sch_generic.c	2004-11-07 22:37:55 +01:00
@@ -483,10 +483,30 @@
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
+	struct list_head cql = LIST_HEAD_INIT(cql);
+	struct Qdisc *cq, *q, *n;
+
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 		!atomic_dec_and_test(&qdisc->refcnt))
 		return;
+
 	list_del(&qdisc->list);
+
+	/* unlink inner qdiscs from dev->qdisc_list immediately */
+	if (qdisc->ops->cl_ops != NULL)
+		list_add(&qdisc->list, &cql);
+	list_for_each_entry(cq, &cql, list) {
+		list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
+		if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
+			if (q->ops->cl_ops != NULL)
+				list_move_tail(&q->list, &cql);
+			else
+				list_del_init(&q->list);
+		}
+	}
+	list_for_each_entry_safe(cq, n, &cql, list)
+		list_del_init(&cq->list);
+
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-07 22:22 ` PROBLEM: IProute hangs after running traffic shaping scripts Patrick McHardy
@ 2004-11-08  1:40   ` Patrick McHardy
  2004-11-08 13:54     ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-08  1:40 UTC (permalink / raw)
  To: Szymon Miotk; +Cc: netdev, Thomas Graf

[-- Attachment #1: Type: text/plain, Size: 829 bytes --]

Patrick McHardy wrote:

> Szymon Miotk wrote:
>
>> This mail was posted 02-11-2004 to linux-net@vger.kernel.org, but I 
>> got no response at all, so I am resending it.
>>
>> [1.] One line summary of the problem:
>> IProute hangs after running traffic shaping scripts
>
>
> Can you test if this patch helps please ? It immediately removes all
> inner qdiscs of classful qdiscs from dev->qdisc_list, making it
> impossible for anyone to look up an inner qdisc that is about to get
> destroyed. Taking qdisc_tree_lock in qdisc_lookup is not really
> necessary with this patch anymore, since all changes to dev->qdisc_list
> happen under the rtnl again (which is also relied on for memory). I put
> it in anyways for now until I have fully analyzed locking.

Please test this patch, the last one had a bug.

>
> Regards
> Patrick
>

[-- Attachment #2: x --]
[-- Type: text/plain, Size: 1434 bytes --]

===== net/sched/sch_api.c 1.42 vs edited =====
--- 1.42/net/sched/sch_api.c	2004-11-07 05:03:04 +01:00
+++ edited/net/sched/sch_api.c	2004-11-08 02:38:04 +01:00
@@ -196,10 +196,14 @@
 {
 	struct Qdisc *q;
 
+	read_lock_bh(&qdisc_tree_lock);
 	list_for_each_entry(q, &dev->qdisc_list, list) {
-		if (q->handle == handle)
+		if (q->handle == handle) {
+			read_unlock_bh(&qdisc_tree_lock);
 			return q;
+		}
 	}
+	read_unlock_bh(&qdisc_tree_lock);
 	return NULL;
 }
 
===== net/sched/sch_generic.c 1.30 vs edited =====
--- 1.30/net/sched/sch_generic.c	2004-11-06 01:34:45 +01:00
+++ edited/net/sched/sch_generic.c	2004-11-08 00:08:28 +01:00
@@ -483,10 +483,29 @@
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
+	struct list_head cql = LIST_HEAD_INIT(cql);
+	struct Qdisc *cq, *q, *n;
+
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 		!atomic_dec_and_test(&qdisc->refcnt))
 		return;
+
 	list_del(&qdisc->list);
+
+	/* unlink inner qdiscs from dev->qdisc_list immediately */
+	if (qdisc->ops->cl_ops != NULL)
+		list_add(&qdisc->list, &cql);
+	list_for_each_entry(cq, &cql, list)
+		list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
+			if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
+				if (q->ops->cl_ops != NULL)
+					list_move_tail(&q->list, &cql);
+				else
+					list_del_init(&q->list);
+			}
+	list_for_each_entry_safe(cq, n, &cql, list)
+		list_del_init(&cq->list);
+
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08  1:40   ` Patrick McHardy
@ 2004-11-08 13:54     ` Thomas Graf
  2004-11-08 16:12       ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-08 13:54 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Szymon Miotk, netdev

> +	/* unlink inner qdiscs from dev->qdisc_list immediately */
> +	if (qdisc->ops->cl_ops != NULL)
> +		list_add(&qdisc->list, &cql);

I think you should extend the above to:

if (qdisc->ops->cl_ops != NULL && !list_empty(&qdisc->list))

Otherwise you might unlink entries just added before the
rcu callback and it prevents an unneeded attempt to delete
qdiscs for inner classful qdiscs

Otherwise the patch looks very good.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08 13:54     ` Thomas Graf
@ 2004-11-08 16:12       ` Patrick McHardy
  2004-11-08 18:33         ` Thomas Graf
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-08 16:12 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Szymon Miotk, netdev

Thomas Graf wrote:

>>+	/* unlink inner qdiscs from dev->qdisc_list immediately */
>>+	if (qdisc->ops->cl_ops != NULL)
>>+		list_add(&qdisc->list, &cql);
>>    
>>
>
>I think you should extend the above to:
>
>if (qdisc->ops->cl_ops != NULL && !list_empty(&qdisc->list))
>
>Otherwise you might unlink entries just added before the
>rcu callback and it prevents an unneeded attempt to delete
>qdiscs for inner classful qdiscs
>  
>
There is some optimization possible, I will do this for the final
patch. But I don't understand the problem you refer to, can you
please explain ?

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08 16:12       ` Patrick McHardy
@ 2004-11-08 18:33         ` Thomas Graf
  2004-11-08 19:46           ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Graf @ 2004-11-08 18:33 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Szymon Miotk, netdev

* Patrick McHardy <418F9AD0.1040701@trash.net> 2004-11-08 17:12
> Thomas Graf wrote:
> 
> >>+	/* unlink inner qdiscs from dev->qdisc_list immediately */
> >>+	if (qdisc->ops->cl_ops != NULL)
> >>+		list_add(&qdisc->list, &cql);
> >>   
> >>
> >
> >I think you should extend the above to:
> >
> >if (qdisc->ops->cl_ops != NULL && !list_empty(&qdisc->list))
> >
> >Otherwise you might unlink entries just added before the
> >rcu callback and it prevents an unneeded attempt to delete
> >qdiscs for inner classful qdiscs
> > 
> >
> There is some optimization possible, I will do this for the final
> patch. But I don't understand the problem you refer to, can you
> please explain ?

I don't have the time to verify this at the moment but:

1) qdisc_destroy unlinking all the lists
2) RTM_NEWQDISC creating a new qdisc with the same major classid as the old one
   which will suceed since the old one cannot be found anymore.
3) rcu callback __qdisc_destroy -> qdisc_destroy looking through qdisc_list
   again and then deleting the new entries because their major classid matches.

I might be missing something though.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08 18:33         ` Thomas Graf
@ 2004-11-08 19:46           ` Patrick McHardy
  2004-11-08 20:15             ` Thomas Graf
                               ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Patrick McHardy @ 2004-11-08 19:46 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Szymon Miotk, netdev

[-- Attachment #1: Type: text/plain, Size: 826 bytes --]

Thomas Graf wrote:

>* Patrick McHardy <418F9AD0.1040701@trash.net> 2004-11-08 17:12
>  
>
>>There is some optimization possible, I will do this for the final
>>patch. But I don't understand the problem you refer to, can you
>>please explain ?
>>
>
>I don't have the time to verify this at the moment but:
>
>1) qdisc_destroy unlinking all the lists
>2) RTM_NEWQDISC creating a new qdisc with the same major classid as the old one
>   which will suceed since the old one cannot be found anymore.
>3) rcu callback __qdisc_destroy -> qdisc_destroy looking through qdisc_list
>   again and then deleting the new entries because their major classid matches.
>
>I might be missing something though.
>
You're right, thanks. The optimization already fixed this, but I
wasn't aware of the bug :) New patch attached.

Regards
Patrick


[-- Attachment #2: x --]
[-- Type: text/plain, Size: 1512 bytes --]

===== net/sched/sch_api.c 1.42 vs edited =====
--- 1.42/net/sched/sch_api.c	2004-11-07 05:03:04 +01:00
+++ edited/net/sched/sch_api.c	2004-11-08 02:38:04 +01:00
@@ -196,10 +196,14 @@
 {
 	struct Qdisc *q;
 
+	read_lock_bh(&qdisc_tree_lock);
 	list_for_each_entry(q, &dev->qdisc_list, list) {
-		if (q->handle == handle)
+		if (q->handle == handle) {
+			read_unlock_bh(&qdisc_tree_lock);
 			return q;
+		}
 	}
+	read_unlock_bh(&qdisc_tree_lock);
 	return NULL;
 }
 
===== net/sched/sch_generic.c 1.30 vs edited =====
--- 1.30/net/sched/sch_generic.c	2004-11-06 01:34:45 +01:00
+++ edited/net/sched/sch_generic.c	2004-11-08 20:41:58 +01:00
@@ -483,10 +483,32 @@
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
+	struct list_head cql = LIST_HEAD_INIT(cql);
+	struct Qdisc *cq, *q, *n;
+
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 		!atomic_dec_and_test(&qdisc->refcnt))
 		return;
-	list_del(&qdisc->list);
+
+	if (!list_empty(&qdisc->list)) {
+		if (qdisc->ops->cl_ops == NULL)
+			list_del(&qdisc->list);
+		else
+			list_move(&qdisc->list, &cql);
+	}
+
+	/* unlink inner qdiscs from dev->qdisc_list immediately */
+	list_for_each_entry(cq, &cql, list)
+		list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
+			if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
+				if (q->ops->cl_ops != NULL)
+					list_move_tail(&q->list, &cql);
+				else
+					list_del_init(&q->list);
+			}
+	list_for_each_entry_safe(cq, n, &cql, list)
+		list_del_init(&cq->list);
+
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08 19:46           ` Patrick McHardy
@ 2004-11-08 20:15             ` Thomas Graf
  2004-11-10  0:18             ` David S. Miller
  2004-11-10 12:08             ` Szymon Miotk
  2 siblings, 0 replies; 34+ messages in thread
From: Thomas Graf @ 2004-11-08 20:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Szymon Miotk, netdev

* Patrick McHardy <418FCD0A.4040202@trash.net> 2004-11-08 20:46
> You're right, thanks. The optimization already fixed this, but I
> wasn't aware of the bug :) New patch attached.

Looks perfectly fine.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08 19:46           ` Patrick McHardy
  2004-11-08 20:15             ` Thomas Graf
@ 2004-11-10  0:18             ` David S. Miller
  2004-11-10  0:40               ` Patrick McHardy
  2004-11-10 12:08             ` Szymon Miotk
  2 siblings, 1 reply; 34+ messages in thread
From: David S. Miller @ 2004-11-10  0:18 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: tgraf, spam, netdev

On Mon, 08 Nov 2004 20:46:18 +0100
Patrick McHardy <kaber@trash.net> wrote:

> New patch attached.

How do these child qdiscs get destroyed at all if you just
remove them from the lists they are on?  How will the rest
of destroy processing find them and clean them up?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-10  0:18             ` David S. Miller
@ 2004-11-10  0:40               ` Patrick McHardy
  2004-11-10  0:55                 ` Patrick McHardy
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-10  0:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: tgraf, spam, netdev

David S. Miller wrote:

>On Mon, 08 Nov 2004 20:46:18 +0100
>Patrick McHardy <kaber@trash.net> wrote:
>
>  
>
>>New patch attached.
>>    
>>
>
>How do these child qdiscs get destroyed at all if you just
>remove them from the lists they are on?  How will the rest
>of destroy processing find them and clean them up?
>
>  
>
The RCU-callback calls ops->destroy. The qdisc knows about it's inner
structure and destroys all classes and the inner qdiscs. dev->qdisc_list
is just a flat list containing all qdiscs of the tree for lookups.

Regards
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-10  0:40               ` Patrick McHardy
@ 2004-11-10  0:55                 ` Patrick McHardy
  2004-11-10  6:13                   ` David S. Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick McHardy @ 2004-11-10  0:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: tgraf, spam, netdev

[-- Attachment #1: Type: text/plain, Size: 1658 bytes --]

Patrick McHardy wrote:

> David S. Miller wrote:
>
>> How do these child qdiscs get destroyed at all if you just
>> remove them from the lists they are on?  How will the rest
>> of destroy processing find them and clean them up?
>>
>>  
>
> The RCU-callback calls ops->destroy. The qdisc knows about it's inner
> structure and destroys all classes and the inner qdiscs. dev->qdisc_list
> is just a flat list containing all qdiscs of the tree for lookups.


This is the final patch:

#   [PKT_SCHED]: Unlink inner qdiscs immediately in qdisc_destroy
#
#   Before the RCU change distruction of the qdisc and all inner
#   qdiscs happend immediately and under the rtnl semaphore. This
#   made sure nothing holding the rtnl semaphore could end up with
#   invalid memory. This is not true anymore, inner qdiscs found on
#   dev->qdisc_list can be suddenly destroyed by the RCU callback.
#   Unlink them immediately when the outer qdisc is destroyed so
#   nothing can find them until they get destroyed.

This also makes semantics sane again, an inner qdiscs should not
be user-visible once the containing qdisc has been destroyed. The
second part (locking in qdisc_lookup) is not really required, but
currently the only purpose of qdisc_tree_lock seems to be to protect
dev->qdisc_list, which is also protected by the rtnl. The rtnl is
especially relied on for making sure nobody frees a qdisc while it
is used in user-context, so qdisc_tree_lock looks unnecessary. I'm
currently reviewing all qdisc locking, if this turns out to be right
I will remove qdisc_tree_lock entirely in a follow-up patch, but for
now I left it in for consistency.

Regards
Patrick


[-- Attachment #2: x --]
[-- Type: text/plain, Size: 3694 bytes --]

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/11/09 07:46:48+01:00 kaber@coreworks.de 
#   [PKT_SCHED]: Unlink inner qdiscs immediately in qdisc_destroy
#   
#   Before the RCU change distruction of the qdisc and all inner
#   qdiscs happend immediately and under the rtnl semaphore. This
#   made sure nothing holding the rtnl semaphore could end up with
#   invalid memory. This is not true anymore, inner qdiscs found on
#   dev->qdisc_list can be suddenly destroyed by the RCU callback.
#   Unlink them immediately when the outer qdisc is destroyed so
#   nothing can find them until they get destroyed.
#   
#   With help from Thomas Graf <tgraf@suug.ch>
#   
#   Signed-off-by: Patrick McHardy <kaber@trash.net>
# 
# net/sched/sch_generic.c
#   2004/11/09 07:46:39+01:00 kaber@coreworks.de +23 -1
#   [PKT_SCHED]: Unlink inner qdiscs immediately in qdisc_destroy
#   
#   Before the RCU change distruction of the qdisc and all inner
#   qdiscs happend immediately and under the rtnl semaphore. This
#   made sure nothing holding the rtnl semaphore could end up with
#   invalid memory. This is not true anymore, inner qdiscs found on
#   dev->qdisc_list can be suddenly destroyed by the RCU callback.
#   Unlink them immediately when the outer qdisc is destroyed so
#   nothing can find them until they get destroyed.
#   
#   With help from Thomas Graf <tgraf@suug.ch>
#   
#   Signed-off-by: Patrick McHardy <kaber@trash.net>
# 
# net/sched/sch_api.c
#   2004/11/09 07:46:39+01:00 kaber@coreworks.de +5 -1
#   [PKT_SCHED]: Unlink inner qdiscs immediately in qdisc_destroy
#   
#   Before the RCU change distruction of the qdisc and all inner
#   qdiscs happend immediately and under the rtnl semaphore. This
#   made sure nothing holding the rtnl semaphore could end up with
#   invalid memory. This is not true anymore, inner qdiscs found on
#   dev->qdisc_list can be suddenly destroyed by the RCU callback.
#   Unlink them immediately when the outer qdisc is destroyed so
#   nothing can find them until they get destroyed.
#   
#   With help from Thomas Graf <tgraf@suug.ch>
#   
#   Signed-off-by: Patrick McHardy <kaber@trash.net>
# 
diff -Nru a/net/sched/sch_api.c b/net/sched/sch_api.c
--- a/net/sched/sch_api.c	2004-11-10 01:45:31 +01:00
+++ b/net/sched/sch_api.c	2004-11-10 01:45:31 +01:00
@@ -196,10 +196,14 @@
 {
 	struct Qdisc *q;
 
+	read_lock_bh(&qdisc_tree_lock);
 	list_for_each_entry(q, &dev->qdisc_list, list) {
-		if (q->handle == handle)
+		if (q->handle == handle) {
+			read_unlock_bh(&qdisc_tree_lock);
 			return q;
+		}
 	}
+	read_unlock_bh(&qdisc_tree_lock);
 	return NULL;
 }
 
diff -Nru a/net/sched/sch_generic.c b/net/sched/sch_generic.c
--- a/net/sched/sch_generic.c	2004-11-10 01:45:31 +01:00
+++ b/net/sched/sch_generic.c	2004-11-10 01:45:31 +01:00
@@ -483,10 +483,32 @@
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
+	struct list_head cql = LIST_HEAD_INIT(cql);
+	struct Qdisc *cq, *q, *n;
+
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 		!atomic_dec_and_test(&qdisc->refcnt))
 		return;
-	list_del(&qdisc->list);
+
+	if (!list_empty(&qdisc->list)) {
+		if (qdisc->ops->cl_ops == NULL)
+			list_del(&qdisc->list);
+		else
+			list_move(&qdisc->list, &cql);
+	}
+
+	/* unlink inner qdiscs from dev->qdisc_list immediately */
+	list_for_each_entry(cq, &cql, list)
+		list_for_each_entry_safe(q, n, &qdisc->dev->qdisc_list, list)
+			if (TC_H_MAJ(q->parent) == TC_H_MAJ(cq->handle)) {
+				if (q->ops->cl_ops == NULL)
+					list_del_init(&q->list);
+				else
+					list_move_tail(&q->list, &cql);
+			}
+	list_for_each_entry_safe(cq, n, &cql, list)
+		list_del_init(&cq->list);
+
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-10  0:55                 ` Patrick McHardy
@ 2004-11-10  6:13                   ` David S. Miller
  0 siblings, 0 replies; 34+ messages in thread
From: David S. Miller @ 2004-11-10  6:13 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: tgraf, spam, netdev

On Wed, 10 Nov 2004 01:55:37 +0100
Patrick McHardy <kaber@trash.net> wrote:

> This is the final patch:
 ...
> #   [PKT_SCHED]: Unlink inner qdiscs immediately in qdisc_destroy

I understand how this work now.  Thanks for the
explanation and updated patch.

Applied.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PROBLEM: IProute hangs after running traffic shaping scripts
  2004-11-08 19:46           ` Patrick McHardy
  2004-11-08 20:15             ` Thomas Graf
  2004-11-10  0:18             ` David S. Miller
@ 2004-11-10 12:08             ` Szymon Miotk
  2 siblings, 0 replies; 34+ messages in thread
From: Szymon Miotk @ 2004-11-10 12:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, netdev, davem

Patrick McHardy wrote:
> Thomas Graf wrote:
> 
>> * Patrick McHardy <418F9AD0.1040701@trash.net> 2004-11-08 17:12
>>  
>>
>>> There is some optimization possible, I will do this for the final
>>> patch. But I don't understand the problem you refer to, can you
>>> please explain ?
>>>
>>
>> I don't have the time to verify this at the moment but:
>>
>> 1) qdisc_destroy unlinking all the lists
>> 2) RTM_NEWQDISC creating a new qdisc with the same major classid as 
>> the old one
>>   which will suceed since the old one cannot be found anymore.
>> 3) rcu callback __qdisc_destroy -> qdisc_destroy looking through 
>> qdisc_list
>>   again and then deleting the new entries because their major classid 
>> matches.
>>
>> I might be missing something though.
>>
> You're right, thanks. The optimization already fixed this, but I
> wasn't aware of the bug :) New patch attached.

I've tested the new kernel (2.6.10-rc1_bk17) with this patch.
It performed well on test system (reloading my traffic shaping 
configuration for 100 times).
Now it forks for some 22 hours on production system, under normal load 
(~20 Mbit, 500 clients at a time, 15000 entires in route table) and it 
performs well.

Thanks a lot for your help, it really saved me away from troubles.

Szymon Miotk

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2004-11-10 12:08 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-05  9:48 PROBLEM: IProute hangs after running traffic shaping scripts Szymon Miotk
2004-11-05 11:54 ` Thomas Graf
2004-11-05 14:16   ` [PATCH] PKT_SCHED: Initialize list field in dummy qdiscs Thomas Graf
2004-11-05 16:12     ` Patrick McHardy
2004-11-05 16:39       ` Thomas Graf
2004-11-05 17:26         ` Patrick McHardy
2004-11-05 17:58           ` Thomas Graf
2004-11-05 18:18             ` Patrick McHardy
2004-11-05 19:43               ` Thomas Graf
2004-11-06  1:18                 ` Thomas Graf
2004-11-06  1:47                   ` Patrick McHardy
2004-11-06  1:59                     ` Thomas Graf
2004-11-06 14:50                       ` Thomas Graf
2004-11-07  8:57                         ` Patrick McHardy
2004-11-07 14:00                           ` Thomas Graf
2004-11-07 16:19                             ` Patrick McHardy
2004-11-07 16:33                               ` Thomas Graf
2004-11-07 17:02                                 ` Patrick McHardy
2004-11-07 17:49                                   ` Thomas Graf
2004-11-07 18:22                                     ` Patrick McHardy
2004-11-07 19:08                                       ` Thomas Graf
2004-11-06  0:36             ` David S. Miller
2004-11-07 22:22 ` PROBLEM: IProute hangs after running traffic shaping scripts Patrick McHardy
2004-11-08  1:40   ` Patrick McHardy
2004-11-08 13:54     ` Thomas Graf
2004-11-08 16:12       ` Patrick McHardy
2004-11-08 18:33         ` Thomas Graf
2004-11-08 19:46           ` Patrick McHardy
2004-11-08 20:15             ` Thomas Graf
2004-11-10  0:18             ` David S. Miller
2004-11-10  0:40               ` Patrick McHardy
2004-11-10  0:55                 ` Patrick McHardy
2004-11-10  6:13                   ` David S. Miller
2004-11-10 12:08             ` Szymon Miotk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.