All of lore.kernel.org
 help / color / mirror / Atom feed
* iproute2: tc deletion freezes whole server
@ 2020-03-22 18:06 Václav Zindulka
  2020-03-23 18:17 ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-22 18:06 UTC (permalink / raw)
  To: netdev

Hello,

I've discovered problem with tc deletion when rules are defined in
egress of physical interface (dual port SFP+ card). I'm working on
shaping solution for medium sized ISP. I was moving from tc filter
setup to nftables classification and abandoned long used ifb
interfaces. Problem is that sometimes during deletion of tc class or
tc qdisc rules whole server freezes for some time. Mostly above 20s,
but not every server behaves this way. It causes BGP peers to fail
with Hold timer expired error. I'm unable to access server even via
IPMI console during that time. It took me a lot of time before I
discovered the cause of the problem and tried to move back to ifb. But
it was a dead end since ifb precedes netlink in packet path and there
is no way to classify packets and redirect them to ifb for shaping.
I'm in touch with nftables developer, but so far we have had no luck
doing packet classification in ifb ingress.

Recently I discovered the existence of perf tool and discovered that
delay was between tc and kernel. This is perf trace of tc qdisc del
dev enp1s0f0 root. Notice the 11s delay at the end. (Similar problem
is during deletion of minor tc class and qdiscs) This was done on
fresh Debian Stretch installation, but mostly the delay is much
greater. This problem is not limited to Debian Stretch. It happens
with Debian Buster and even with Ubuntu 19.10. It happens on kernels
4.9, 4.19, 5.1.2, 5.2.3, 5.4.6 as far as I've tested. It is not caused
by one manufacturer or device driver. We mostly use dual port Intel
82599ES cards made by SuperMicro and I've tried kernel drivers as well
as latest ixgbe driver. Exactly the same problem is with dual port
Mellanox ConnectX-4 LX, Myricom 10G-PCIE-8B cards too. Whole network
adapter resets after the deletion of rules.

perf trace tc qdisc del dev enp1s0f0 root
           ? (     ?   ): tc/9491  ... [continued]: execve()) = 0
     0.028 ( 0.002 ms): tc/9491 brk(
                               ) = 0x559328a6e000
     0.049 ( 0.004 ms): tc/9491 access(filename: 0x59eec4e7
                               ) = -1 ENOENT No such file or directory
     0.056 ( 0.002 ms): tc/9491 access(filename: 0x59eeec70, mode: R
                               ) = -1 ENOENT No such file or directory
     0.065 ( 0.004 ms): tc/9491 open(filename: 0x59eec988, flags:
CLOEXEC                             ) = 3
     0.071 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f8b0                 ) = 0
     0.074 ( 0.004 ms): tc/9491 mmap(len: 40805, prot: READ, flags:
PRIVATE, fd: 3                    ) = 0x7f425a0e8000
     0.080 ( 0.001 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.086 ( 0.002 ms): tc/9491 access(filename: 0x59eec4e7
                               ) = -1 ENOENT No such file or directory
     0.092 ( 0.004 ms): tc/9491 open(filename: 0x5a0f4d68, flags:
CLOEXEC                             ) = 3
     0.097 ( 0.003 ms): tc/9491 read(fd: 3<socket:[151207]>, buf:
0x7ffe5cf5fa58, count: 832          ) = 832
     0.102 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f8f0                 ) = 0
     0.106 ( 0.003 ms): tc/9491 mmap(len: 8192, prot: READ|WRITE,
flags: PRIVATE|ANONYMOUS, fd: -1    ) = 0x7f425a0e6000
     0.113 ( 0.004 ms): tc/9491 mmap(len: 2191816, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7f4259cb7000
     0.119 ( 0.006 ms): tc/9491 mprotect(start: 0x7f4259cce000, len:
2093056                          ) = 0
     0.126 ( 0.005 ms): tc/9491 mmap(addr: 0x7f4259ecd000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 90112) =
0x7f4259ecd000
     0.141 ( 0.002 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.149 ( 0.002 ms): tc/9491 access(filename: 0x59eec4e7
                               ) = -1 ENOENT No such file or directory
     0.154 ( 0.004 ms): tc/9491 open(filename: 0x5a0e64c8, flags:
CLOEXEC                             ) = 3
     0.160 ( 0.002 ms): tc/9491 read(fd: 3<socket:[151207]>, buf:
0x7ffe5cf5fa28, count: 832          ) = 832
     0.164 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f8c0                 ) = 0
     0.168 ( 0.004 ms): tc/9491 mmap(len: 3158248, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7f42599b3000
     0.174 ( 0.005 ms): tc/9491 mprotect(start: 0x7f4259ab6000, len:
2093056                          ) = 0
     0.181 ( 0.004 ms): tc/9491 mmap(addr: 0x7f4259cb5000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 1056768)
= 0x7f4259cb5000
     0.194 ( 0.001 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.200 ( 0.002 ms): tc/9491 access(filename: 0x59eec4e7
                               ) = -1 ENOENT No such file or directory
     0.205 ( 0.003 ms): tc/9491 open(filename: 0x5a0e69a8, flags:
CLOEXEC                             ) = 3
     0.210 ( 0.002 ms): tc/9491 read(fd: 3<socket:[151207]>, buf:
0x7ffe5cf5f9f8, count: 832          ) = 832
     0.213 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f890                 ) = 0
     0.217 ( 0.004 ms): tc/9491 mmap(len: 2109680, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7f42597af000
     0.223 ( 0.006 ms): tc/9491 mprotect(start: 0x7f42597b2000, len:
2093056                          ) = 0
     0.230 ( 0.004 ms): tc/9491 mmap(addr: 0x7f42599b1000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 8192) =
0x7f42599b1000
     0.243 ( 0.001 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.248 ( 0.002 ms): tc/9491 access(filename: 0x59eec4e7
                               ) = -1 ENOENT No such file or directory
     0.253 ( 0.004 ms): tc/9491 open(filename: 0x5a0e6e98, flags:
CLOEXEC                             ) = 3
     0.259 ( 0.002 ms): tc/9491 read(fd: 3<socket:[151207]>, buf:
0x7ffe5cf5f9c8, count: 832          ) = 832
     0.263 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f860                 ) = 0
     0.268 ( 0.004 ms): tc/9491 mmap(len: 3795296, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7f4259410000
     0.274 ( 0.006 ms): tc/9491 mprotect(start: 0x7f42595a5000, len:
2097152                          ) = 0
     0.282 ( 0.005 ms): tc/9491 mmap(addr: 0x7f42597a5000, len: 24576,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 1658880)
= 0x7f42597a5000
     0.292 ( 0.003 ms): tc/9491 mmap(addr: 0x7f42597ab000, len: 14688,
prot: READ|WRITE, flags: PRIVATE|ANONYMOUS|FIXED, fd: -1) =
0x7f42597ab000
     0.302 ( 0.001 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.312 ( 0.002 ms): tc/9491 access(filename: 0x59eec4e7
                               ) = -1 ENOENT No such file or directory
     0.317 ( 0.003 ms): tc/9491 open(filename: 0x5a0e7378, flags:
CLOEXEC                             ) = 3
     0.322 ( 0.002 ms): tc/9491 read(fd: 3<socket:[151207]>, buf:
0x7ffe5cf5f878, count: 832          ) = 832
     0.326 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f710                 ) = 0
     0.329 ( 0.004 ms): tc/9491 mmap(len: 2200072, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7f42591f6000
     0.335 ( 0.004 ms): tc/9491 mprotect(start: 0x7f425920f000, len:
2093056                          ) = 0
     0.341 ( 0.004 ms): tc/9491 mmap(addr: 0x7f425940e000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 98304) =
0x7f425940e000
     0.354 ( 0.001 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.376 ( 0.003 ms): tc/9491 mmap(len: 8192, prot: READ|WRITE,
flags: PRIVATE|ANONYMOUS, fd: -1    ) = 0x7f425a0e4000
     0.387 ( 0.002 ms): tc/9491 arch_prctl(option: 4098, arg2:
139922955456576, arg3: 139922955458896, arg4: 139922955453744, arg5:
120) = 0
     0.453 ( 0.006 ms): tc/9491 mprotect(start: 0x7f42597a5000, len:
16384, prot: READ                ) = 0
     0.463 ( 0.004 ms): tc/9491 mprotect(start: 0x7f425940e000, len:
4096, prot: READ                 ) = 0
     0.473 ( 0.004 ms): tc/9491 mprotect(start: 0x7f42599b1000, len:
4096, prot: READ                 ) = 0
     0.491 ( 0.004 ms): tc/9491 mprotect(start: 0x7f4259cb5000, len:
4096, prot: READ                 ) = 0
     0.499 ( 0.003 ms): tc/9491 mprotect(start: 0x7f4259ecd000, len:
4096, prot: READ                 ) = 0
     0.526 ( 0.004 ms): tc/9491 mprotect(start: 0x559326baa000, len:
4096, prot: READ                 ) = 0
     0.534 ( 0.004 ms): tc/9491 mprotect(start: 0x7f425a0f2000, len:
4096, prot: READ                 ) = 0
     0.540 ( 0.010 ms): tc/9491 munmap(addr: 0x7f425a0e8000, len:
40805                               ) = 0
     0.634 ( 0.002 ms): tc/9491 brk(
                               ) = 0x559328a6e000
     0.638 ( 0.003 ms): tc/9491 brk(brk: 0x559328a8f000
                               ) = 0x559328a8f000
     0.650 ( 0.024 ms): tc/9491 open(filename: 0x269a18d1, mode:
IRUGO|IWUGO                          ) = 3
     0.681 ( 0.002 ms): tc/9491 fstat(fd: 3<socket:[151207]>, statbuf:
0x7ffe5cf5f9a0                 ) = 0
     0.685 ( 0.004 ms): tc/9491 read(fd: 3<socket:[151207]>, buf:
0x559328a6e240, count: 1024         ) = 36
     0.696 ( 0.002 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
     0.705 ( 0.007 ms): tc/9491 socket(family: NETLINK, type:
RAW|CLOEXEC                             ) = 3
     0.715 ( 0.002 ms): tc/9491 setsockopt(fd: 3<socket:[151207]>,
level: 1, optname: 7, optval: 0x7ffe5cf60294, optlen: 4) = 0
     0.719 ( 0.002 ms): tc/9491 setsockopt(fd: 3<socket:[151207]>,
level: 1, optname: 8, optval: 0x559326bac8c8, optlen: 4) = 0
     0.723 ( 0.003 ms): tc/9491 bind(fd: 3<socket:[151207]>, umyaddr:
0x559326bb8004, addrlen: 12     ) = 0
     0.728 ( 0.002 ms): tc/9491 getsockname(fd: 3<socket:[151207]>,
usockaddr: 0x559326bb8004, usockaddr_len: 0x7ffe5cf60290) = 0
     0.781 ( 0.033 ms): tc/9491 sendto(fd: 3<socket:[151207]>, buff:
0x7ffe5cf50180, len: 40          ) = 40
     0.819 ( 0.031 ms): tc/9491 recvmsg(fd: 3<socket:[151207]>, msg:
0x7ffe5cf480e0                   ) = 3752
     0.864 ( 0.005 ms): tc/9491 recvmsg(fd: 3<socket:[151207]>, msg:
0x7ffe5cf480e0                   ) = 5036
     0.872 ( 0.002 ms): tc/9491 recvmsg(fd: 3<socket:[151207]>, msg:
0x7ffe5cf480e0                   ) = 20
     0.888 (11260.106 ms): tc/9491 sendmsg(fd: 3<socket:[151207]>,
msg: 0x7ffe5cf48140                   ) = 36
 11261.021 ( 0.007 ms): tc/9491 recvmsg(fd: 3<socket:[151207]>, msg:
0x7ffe5cf48140                   ) = 36
 11261.037 ( 0.003 ms): tc/9491 close(fd: 3<socket:[151207]>
                               ) = 0
 11261.100 (     ?   ): tc/9491 exit_group(
                               )

When I call this command on ifb interface or RJ45 interface everything
is done within one second.

perf trace tc qdisc del dev ifb0 root
           ? (     ?   ): tc/9400  ... [continued]: execve()) = 0
     0.017 ( 0.001 ms): tc/9400 brk(
                               ) = 0x5569cb9ca000
     0.030 ( 0.003 ms): tc/9400 access(filename: 0xe2fbc4e7
                               ) = -1 ENOENT No such file or directory
     0.035 ( 0.001 ms): tc/9400 access(filename: 0xe2fbec70, mode: R
                               ) = -1 ENOENT No such file or directory
     0.041 ( 0.002 ms): tc/9400 open(filename: 0xe2fbc988, flags:
CLOEXEC                             ) = 3
     0.044 ( 0.002 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870d50                 ) = 0
     0.047 ( 0.002 ms): tc/9400 mmap(len: 40805, prot: READ, flags:
PRIVATE, fd: 3                    ) = 0x7fa6e31b8000
     0.050 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.054 ( 0.001 ms): tc/9400 access(filename: 0xe2fbc4e7
                               ) = -1 ENOENT No such file or directory
     0.057 ( 0.003 ms): tc/9400 open(filename: 0xe31c4d68, flags:
CLOEXEC                             ) = 3
     0.060 ( 0.002 ms): tc/9400 read(fd: 3<socket:[148457]>, buf:
0x7fff37870ef8, count: 832          ) = 832
     0.064 ( 0.001 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870d90                 ) = 0
     0.065 ( 0.002 ms): tc/9400 mmap(len: 8192, prot: READ|WRITE,
flags: PRIVATE|ANONYMOUS, fd: -1    ) = 0x7fa6e31b6000
     0.070 ( 0.002 ms): tc/9400 mmap(len: 2191816, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7fa6e2d87000
     0.073 ( 0.004 ms): tc/9400 mprotect(start: 0x7fa6e2d9e000, len:
2093056                          ) = 0
     0.078 ( 0.003 ms): tc/9400 mmap(addr: 0x7fa6e2f9d000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 90112) =
0x7fa6e2f9d000
     0.087 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.091 ( 0.001 ms): tc/9400 access(filename: 0xe2fbc4e7
                               ) = -1 ENOENT No such file or directory
     0.094 ( 0.002 ms): tc/9400 open(filename: 0xe31b64c8, flags:
CLOEXEC                             ) = 3
     0.097 ( 0.001 ms): tc/9400 read(fd: 3<socket:[148457]>, buf:
0x7fff37870ec8, count: 832          ) = 832
     0.100 ( 0.001 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870d60                 ) = 0
     0.102 ( 0.003 ms): tc/9400 mmap(len: 3158248, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7fa6e2a83000
     0.106 ( 0.004 ms): tc/9400 mprotect(start: 0x7fa6e2b86000, len:
2093056                          ) = 0
     0.110 ( 0.003 ms): tc/9400 mmap(addr: 0x7fa6e2d85000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 1056768)
= 0x7fa6e2d85000
     0.118 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.122 ( 0.001 ms): tc/9400 access(filename: 0xe2fbc4e7
                               ) = -1 ENOENT No such file or directory
     0.124 ( 0.002 ms): tc/9400 open(filename: 0xe31b69a8, flags:
CLOEXEC                             ) = 3
     0.127 ( 0.001 ms): tc/9400 read(fd: 3<socket:[148457]>, buf:
0x7fff37870e98, count: 832          ) = 832
     0.130 ( 0.001 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870d30                 ) = 0
     0.132 ( 0.003 ms): tc/9400 mmap(len: 2109680, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7fa6e287f000
     0.135 ( 0.003 ms): tc/9400 mprotect(start: 0x7fa6e2882000, len:
2093056                          ) = 0
     0.139 ( 0.003 ms): tc/9400 mmap(addr: 0x7fa6e2a81000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 8192) =
0x7fa6e2a81000
     0.147 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.150 ( 0.001 ms): tc/9400 access(filename: 0xe2fbc4e7
                               ) = -1 ENOENT No such file or directory
     0.153 ( 0.002 ms): tc/9400 open(filename: 0xe31b6e98, flags:
CLOEXEC                             ) = 3
     0.156 ( 0.001 ms): tc/9400 read(fd: 3<socket:[148457]>, buf:
0x7fff37870e68, count: 832          ) = 832
     0.158 ( 0.001 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870d00                 ) = 0
     0.161 ( 0.003 ms): tc/9400 mmap(len: 3795296, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7fa6e24e0000
     0.165 ( 0.004 ms): tc/9400 mprotect(start: 0x7fa6e2675000, len:
2097152                          ) = 0
     0.169 ( 0.003 ms): tc/9400 mmap(addr: 0x7fa6e2875000, len: 24576,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 1658880)
= 0x7fa6e2875000
     0.175 ( 0.002 ms): tc/9400 mmap(addr: 0x7fa6e287b000, len: 14688,
prot: READ|WRITE, flags: PRIVATE|ANONYMOUS|FIXED, fd: -1) =
0x7fa6e287b000
     0.182 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.187 ( 0.001 ms): tc/9400 access(filename: 0xe2fbc4e7
                               ) = -1 ENOENT No such file or directory
     0.190 ( 0.002 ms): tc/9400 open(filename: 0xe31b7378, flags:
CLOEXEC                             ) = 3
     0.193 ( 0.001 ms): tc/9400 read(fd: 3<socket:[148457]>, buf:
0x7fff37870d18, count: 832          ) = 832
     0.195 ( 0.001 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870bb0                 ) = 0
     0.197 ( 0.003 ms): tc/9400 mmap(len: 2200072, prot: EXEC|READ,
flags: PRIVATE|DENYWRITE, fd: 3   ) = 0x7fa6e22c6000
     0.201 ( 0.003 ms): tc/9400 mprotect(start: 0x7fa6e22df000, len:
2093056                          ) = 0
     0.205 ( 0.002 ms): tc/9400 mmap(addr: 0x7fa6e24de000, len: 8192,
prot: READ|WRITE, flags: PRIVATE|DENYWRITE|FIXED, fd: 3, off: 98304) =
0x7fa6e24de000
     0.213 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.226 ( 0.002 ms): tc/9400 mmap(len: 8192, prot: READ|WRITE,
flags: PRIVATE|ANONYMOUS, fd: -1    ) = 0x7fa6e31b4000
     0.232 ( 0.001 ms): tc/9400 arch_prctl(option: 4098, arg2:
140354751516736, arg3: 140354751519056, arg4: 140354751513904, arg5:
120) = 0
     0.273 ( 0.004 ms): tc/9400 mprotect(start: 0x7fa6e2875000, len:
16384, prot: READ                ) = 0
     0.279 ( 0.002 ms): tc/9400 mprotect(start: 0x7fa6e24de000, len:
4096, prot: READ                 ) = 0
     0.285 ( 0.002 ms): tc/9400 mprotect(start: 0x7fa6e2a81000, len:
4096, prot: READ                 ) = 0
     0.295 ( 0.002 ms): tc/9400 mprotect(start: 0x7fa6e2d85000, len:
4096, prot: READ                 ) = 0
     0.299 ( 0.002 ms): tc/9400 mprotect(start: 0x7fa6e2f9d000, len:
4096, prot: READ                 ) = 0
     0.316 ( 0.002 ms): tc/9400 mprotect(start: 0x5569cb616000, len:
4096, prot: READ                 ) = 0
     0.321 ( 0.002 ms): tc/9400 mprotect(start: 0x7fa6e31c2000, len:
4096, prot: READ                 ) = 0
     0.324 ( 0.006 ms): tc/9400 munmap(addr: 0x7fa6e31b8000, len:
40805                               ) = 0
     0.379 ( 0.001 ms): tc/9400 brk(
                               ) = 0x5569cb9ca000
     0.381 ( 0.002 ms): tc/9400 brk(brk: 0x5569cb9eb000
                               ) = 0x5569cb9eb000
     0.387 ( 0.013 ms): tc/9400 open(filename: 0xcb40d8d1, mode:
IRUGO|IWUGO                          ) = 3
     0.404 ( 0.001 ms): tc/9400 fstat(fd: 3<socket:[148457]>, statbuf:
0x7fff37870e40                 ) = 0
     0.407 ( 0.002 ms): tc/9400 read(fd: 3<socket:[148457]>, buf:
0x5569cb9ca240, count: 1024         ) = 36
     0.413 ( 0.001 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
     0.419 ( 0.005 ms): tc/9400 socket(family: NETLINK, type:
RAW|CLOEXEC                             ) = 3
     0.426 ( 0.002 ms): tc/9400 setsockopt(fd: 3<socket:[148457]>,
level: 1, optname: 7, optval: 0x7fff37871734, optlen: 4) = 0
     0.429 ( 0.001 ms): tc/9400 setsockopt(fd: 3<socket:[148457]>,
level: 1, optname: 8, optval: 0x5569cb6188c8, optlen: 4) = 0
     0.431 ( 0.002 ms): tc/9400 bind(fd: 3<socket:[148457]>, umyaddr:
0x5569cb624004, addrlen: 12     ) = 0
     0.434 ( 0.001 ms): tc/9400 getsockname(fd: 3<socket:[148457]>,
usockaddr: 0x5569cb624004, usockaddr_len: 0x7fff37871730) = 0
     0.466 ( 0.024 ms): tc/9400 sendto(fd: 3<socket:[148457]>, buff:
0x7fff37861620, len: 40          ) = 40
     0.493 ( 0.020 ms): tc/9400 recvmsg(fd: 3<socket:[148457]>, msg:
0x7fff37859580                   ) = 3752
     0.521 ( 0.002 ms): tc/9400 recvmsg(fd: 3<socket:[148457]>, msg:
0x7fff37859580                   ) = 5032
     0.525 ( 0.001 ms): tc/9400 recvmsg(fd: 3<socket:[148457]>, msg:
0x7fff37859580                   ) = 20
     0.535 (516.200 ms): tc/9400 sendmsg(fd: 3<socket:[148457]>, msg:
0x7fff378595e0                   ) = 36
   516.744 ( 0.008 ms): tc/9400 recvmsg(fd: 3<socket:[148457]>, msg:
0x7fff378595e0                   ) = 36
   516.761 ( 0.002 ms): tc/9400 close(fd: 3<socket:[148457]>
                               ) = 0
   516.823 (     ?   ): tc/9400 exit_group(
                               )

My testing setup consists of approx. 18k tc class rules and approx.
13k tc qdisc rules and was altered only with different interface name.
Everything works OK with ifb interfaces and with metallic interfaces.
I don't know how to diagnose the problem further. It is most likely
that it will work with regular network cards. All problems begin with
SFP+ interfaces. I do a lot of dynamic operations and I modify shaping
tree according to real situation and changes in network so I'm doing
deletion of tc rules regularly. It is a matter of hours or days before
the whole server freezes due to tc deletion problem. I have reproducer
batches for tc ready if anybody will be willing to have a look at this
issue. I may offer one server which has this problem every time to
debug and test it. Or at least I would appreciate some advice on how
to diagnose process of tc deletion further.

Thank you
----
S pozdravem / Best Regards

Vaclav Zindulka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-22 18:06 iproute2: tc deletion freezes whole server Václav Zindulka
@ 2020-03-23 18:17 ` Cong Wang
  2020-03-24 16:27   ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-03-23 18:17 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

Hello,

On Sun, Mar 22, 2020 at 11:07 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> Hello,
...

> Recently I discovered the existence of perf tool and discovered that
> delay was between tc and kernel. This is perf trace of tc qdisc del
> dev enp1s0f0 root. Notice the 11s delay at the end. (Similar problem
> is during deletion of minor tc class and qdiscs) This was done on
> fresh Debian Stretch installation, but mostly the delay is much
> greater. This problem is not limited to Debian Stretch. It happens
> with Debian Buster and even with Ubuntu 19.10. It happens on kernels
> 4.9, 4.19, 5.1.2, 5.2.3, 5.4.6 as far as I've tested. It is not caused
> by one manufacturer or device driver. We mostly use dual port Intel
> 82599ES cards made by SuperMicro and I've tried kernel drivers as well
> as latest ixgbe driver. Exactly the same problem is with dual port
> Mellanox ConnectX-4 LX, Myricom 10G-PCIE-8B cards too. Whole network
> adapter resets after the deletion of rules.
>
> perf trace tc qdisc del dev enp1s0f0 root

Can you capture a `perf record` for kernel functions too? We
need to know where kernel spent time on during this 11s delay.

>
> When I call this command on ifb interface or RJ45 interface everything
> is done within one second.


Do they have the same tc configuration and same workload?


> My testing setup consists of approx. 18k tc class rules and approx.
> 13k tc qdisc rules and was altered only with different interface name.
> Everything works OK with ifb interfaces and with metallic interfaces.
> I don't know how to diagnose the problem further. It is most likely
> that it will work with regular network cards. All problems begin with
> SFP+ interfaces. I do a lot of dynamic operations and I modify shaping
> tree according to real situation and changes in network so I'm doing
> deletion of tc rules regularly. It is a matter of hours or days before
> the whole server freezes due to tc deletion problem. I have reproducer
> batches for tc ready if anybody will be willing to have a look at this
> issue. I may offer one server which has this problem every time to
> debug and test it. Or at least I would appreciate some advice on how
> to diagnose process of tc deletion further.

Please share you tc configurations (tc -s -d qd show dev ..., tc
-s -d filter show dev...).

Also, it would be great if you can provide a minimal reproducer.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-23 18:17 ` Cong Wang
@ 2020-03-24 16:27   ` Václav Zindulka
  2020-03-24 22:57     ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-24 16:27 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

Hello,

Thank you for the reply!

On Mon, Mar 23, 2020 at 7:18 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > perf trace tc qdisc del dev enp1s0f0 root
>
> Can you capture a `perf record` for kernel functions too? We
> need to know where kernel spent time on during this 11s delay.

See perd.data.{ifb0, enp1s0f0} here
https://github.com/zvalcav/tc-kernel/tree/master/20200324. I hope it
is the output you wanted. If you need anything else, let me know.

> > When I call this command on ifb interface or RJ45 interface everything
> > is done within one second.
>
>
> Do they have the same tc configuration and same workload?

Yes, both reproducers are exactly the same, interfaces are configured
in a similar way. I have the most of the offloading turned off for
physical interfaces. Yet metallic interfaces don't cause that big
delay and SFP+ dual port cards do, yet not all of them. Only
difference in reproducers is the interface name. See git repository
above, tc-interface_name-upload/download.txt files. I have altered my
whole setup in daemon I'm working on to change interfaces used. The
only difference is the existence of ingress tc filter rules to
redirect traffic to ifb interfaces in production setup. I don't use tc
filter classification in current setup. I use nftables' ability to
classify traffic. There is no traffic on interfaces except ssh
session. It behaves similar way with and without traffic.

> > My testing setup consists of approx. 18k tc class rules and approx.
> > 13k tc qdisc rules and was altered only with different interface name....
>
> Please share you tc configurations (tc -s -d qd show dev ..., tc
> -s -d filter show dev...).

I've placed whole reproducers into repository. Do you need exports of rules too?

> Also, it would be great if you can provide a minimal reproducer.

I'm afraid that minor reproducer won't cause the problem. This was
happening mostly on servers with large tc rule setups. I was trying to
create small reproducer for nftables developer many times without
success. I can try to create reproducer as small as possible, but it
may still consist of thousands of rules.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-24 16:27   ` Václav Zindulka
@ 2020-03-24 22:57     ` Cong Wang
  2020-03-25 11:27       ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-03-24 22:57 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Tue, Mar 24, 2020 at 9:27 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> Hello,
>
> Thank you for the reply!
>
> On Mon, Mar 23, 2020 at 7:18 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > perf trace tc qdisc del dev enp1s0f0 root
> >
> > Can you capture a `perf record` for kernel functions too? We
> > need to know where kernel spent time on during this 11s delay.
>
> See perd.data.{ifb0, enp1s0f0} here
> https://github.com/zvalcav/tc-kernel/tree/master/20200324. I hope it
> is the output you wanted. If you need anything else, let me know.

Hm, my bad, please also run `perf report -g` after you record them,
we need the text output with stack traces.

Also, do you have a complete kernel log too? If your network is completely
down, you need some serial console to capture it, kdump could also help
if you know how to setup it. The kernel log usually has some indication
for hangs, for example if we have too much to do with RTNL lock held,
kernel would complain some other tasks hung on waiting for RTNL.

>
> > > When I call this command on ifb interface or RJ45 interface everything
> > > is done within one second.
> >
> >
> > Do they have the same tc configuration and same workload?
>
> Yes, both reproducers are exactly the same, interfaces are configured
> in a similar way. I have the most of the offloading turned off for
> physical interfaces. Yet metallic interfaces don't cause that big
> delay and SFP+ dual port cards do, yet not all of them. Only
> difference in reproducers is the interface name. See git repository
> above, tc-interface_name-upload/download.txt files. I have altered my
> whole setup in daemon I'm working on to change interfaces used. The
> only difference is the existence of ingress tc filter rules to
> redirect traffic to ifb interfaces in production setup. I don't use tc
> filter classification in current setup. I use nftables' ability to
> classify traffic. There is no traffic on interfaces except ssh
> session. It behaves similar way with and without traffic.

This rules out slow path vs. fast path scenario. So, the problem here
is probably there are just too many TC classes and filters to destroy.

Does this also mean it is 100% reproducible when you have the same
number of classes and filters?


>
> > > My testing setup consists of approx. 18k tc class rules and approx.
> > > 13k tc qdisc rules and was altered only with different interface name....
> >
> > Please share you tc configurations (tc -s -d qd show dev ..., tc
> > -s -d filter show dev...).
>
> I've placed whole reproducers into repository. Do you need exports of rules too?
>
> > Also, it would be great if you can provide a minimal reproducer.
>
> I'm afraid that minor reproducer won't cause the problem. This was
> happening mostly on servers with large tc rule setups. I was trying to
> create small reproducer for nftables developer many times without
> success. I can try to create reproducer as small as possible, but it
> may still consist of thousands of rules.

Yeah, this problem is probably TC specific, as we clean up from
the top qdisc down to each class and each filter.

Can you try to reproduce the number of TC classes, for example,
down to half, to see if the problem is gone? This could confirm
whether it is related to the number of TC classes/filters.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-24 22:57     ` Cong Wang
@ 2020-03-25 11:27       ` Václav Zindulka
  2020-03-25 13:33         ` Václav Zindulka
  2020-03-25 17:43         ` Cong Wang
  0 siblings, 2 replies; 32+ messages in thread
From: Václav Zindulka @ 2020-03-25 11:27 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Tue, Mar 24, 2020 at 11:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Hm, my bad, please also run `perf report -g` after you record them,
> we need the text output with stack traces.

No problem. I've created reports on two servers with different cards.
See here https://github.com/zvalcav/tc-kernel/tree/master/20200325

> Also, do you have a complete kernel log too? If your network is completely
> down, you need some serial console to capture it, kdump could also help
> if you know how to setup it. The kernel log usually has some indication
> for hangs, for example if we have too much to do with RTNL lock held,
> kernel would complain some other tasks hung on waiting for RTNL.

I've exported dmesg for now. There are problems similar to this one on
both servers
perf: interrupt took too long (3173 > 3128), lowering
kernel.perf_event_max_sample_rate to 63000.

I don't think that kernel crashes because every time it happened
network adapter reset, but nothing else appeared in /var/log. There
are Hold timer expired errors from bird due to our custom BGP setup
with 20s hold. Sometimes there is:
bird6: I/O loop cycle took 5041 ms for 2 events

I'll try to setup kdump. On testing servers the problems are not that
significant and I'm able to access them if I delete from other
interface than I use to access it. Under heavy workload it causes
system freezes. Yet I can't reproduce it on production servers because
it limits our customers. I'll try to generate some traffic to put some
pressure on it if necessary.

> > > > When I call this command on ifb interface or RJ45 interface everything
> > > > is done within one second.
> > >
> > >
> > > Do they have the same tc configuration and same workload?
> >
> > Yes, both reproducers are exactly the same, interfaces are configured
> > in a similar way. I have the most of the offloading turned off for
> > physical interfaces. Yet metallic interfaces don't cause that big
> > delay and SFP+ dual port cards do, yet not all of them. Only
> > difference in reproducers is the interface name. See git repository
> > above, tc-interface_name-upload/download.txt files. I have altered my
> > whole setup in daemon I'm working on to change interfaces used. The
> > only difference is the existence of ingress tc filter rules to
> > redirect traffic to ifb interfaces in production setup. I don't use tc
> > filter classification in current setup. I use nftables' ability to
> > classify traffic. There is no traffic on interfaces except ssh
> > session. It behaves similar way with and without traffic.
>
> This rules out slow path vs. fast path scenario. So, the problem here
> is probably there are just too many TC classes and filters to destroy.

I think this is not the main cause. It happens with a lot of rules,
that is true, but it happens even with deletion of smaller number of
classes and qdiscs. I delete root qdisc just at the exit of my
program. All other changes are only deletions, creations and changes
of hfsc definitions. When I put this setup to production there were
random freezes and BGP timeouts according to changes in network that
caused deletion of tc rules. Not to mention that deletion from ifb or
metallic interfaces works without a flaw.

> Does this also mean it is 100% reproducible when you have the same
> number of classes and filters?

Yes. If it happens on particular server it happens every time. Server
is inaccessible on the interface where I invoke deletion. With heavier
traffic it is inaccessible at all. However there are servers with
identical HW configuration where there are absolutely no problems. We
have about 30 servers and a few of them work without problems. Yet the
most of them freeze during deletion.

> > > > My testing setup consists of approx. 18k tc class rules and approx.
> > > > 13k tc qdisc rules and was altered only with different interface name....
> > >
> > > Please share you tc configurations (tc -s -d qd show dev ..., tc
> > > -s -d filter show dev...).
> >
> > I've placed whole reproducers into repository. Do you need exports of rules too?
> >
> > > Also, it would be great if you can provide a minimal reproducer.
> >
> > I'm afraid that minor reproducer won't cause the problem. This was
> > happening mostly on servers with large tc rule setups. I was trying to
> > create small reproducer for nftables developer many times without
> > success. I can try to create reproducer as small as possible, but it
> > may still consist of thousands of rules.
>
> Yeah, this problem is probably TC specific, as we clean up from
> the top qdisc down to each class and each filter.

As I mentioned earlier, this happens even with specific deletion of
smaller number of rules. Yet I don't oppose it may be caused by tc.
Just inability to process any packets is strange and I'm not sure it
is pure tc problem.

> Can you try to reproduce the number of TC classes, for example,
> down to half, to see if the problem is gone? This could confirm
> whether it is related to the number of TC classes/filters.

Sure. I'll try to reduce the size of tc rule set and test the problem further.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-25 11:27       ` Václav Zindulka
@ 2020-03-25 13:33         ` Václav Zindulka
  2020-03-25 17:43         ` Cong Wang
  1 sibling, 0 replies; 32+ messages in thread
From: Václav Zindulka @ 2020-03-25 13:33 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Wed, Mar 25, 2020 at 12:27 PM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
> > > > > My testing setup consists of approx. 18k tc class rules and approx.
> > > > > 13k tc qdisc rules and was altered only with different interface name....
> > > >
> > > > Please share you tc configurations (tc -s -d qd show dev ..., tc
> > > > -s -d filter show dev...).
> > >
> > > I've placed whole reproducers into repository. Do you need exports of rules too?
> > >
> > > > Also, it would be great if you can provide a minimal reproducer.
> > >
> > > I'm afraid that minor reproducer won't cause the problem. This was
> > > happening mostly on servers with large tc rule setups. I was trying to
> > > create small reproducer for nftables developer many times without
> > > success. I can try to create reproducer as small as possible, but it
> > > may still consist of thousands of rules.
> >
> > Yeah, this problem is probably TC specific, as we clean up from
> > the top qdisc down to each class and each filter.
>
> As I mentioned earlier, this happens even with specific deletion of
> smaller number of rules. Yet I don't oppose it may be caused by tc.
> Just inability to process any packets is strange and I'm not sure it
> is pure tc problem.
>
> > Can you try to reproduce the number of TC classes, for example,
> > down to half, to see if the problem is gone? This could confirm
> > whether it is related to the number of TC classes/filters.
>
> Sure. I'll try to reduce the size of tc rule set and test the problem further.

I've tested it. Delay shortens and extends according to number of
rules to delete.

with approx. 3500 rules it took 2364ms
with approx. 7000 rules it took 5526ms
with approx. 14000 rules it took 11641ms
with approx. 18000 rules it took 13302ms
with approx. 31000 rules it took 22880ms

Do you want me to test it with ifb interface too?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-25 11:27       ` Václav Zindulka
  2020-03-25 13:33         ` Václav Zindulka
@ 2020-03-25 17:43         ` Cong Wang
  2020-03-25 18:23           ` Václav Zindulka
  1 sibling, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-03-25 17:43 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 1097 bytes --]

On Wed, Mar 25, 2020 at 4:28 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Tue, Mar 24, 2020 at 11:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > Hm, my bad, please also run `perf report -g` after you record them,
> > we need the text output with stack traces.
>
> No problem. I've created reports on two servers with different cards.
> See here https://github.com/zvalcav/tc-kernel/tree/master/20200325

That is great!

Your kernel log does not show anything useful, so it did not lead to
any kernel hang or crash etc. at all. (This also means you do not need
to try kdump.)

Are you able to test an experimental patch attached in this email?

It looks like your kernel spent too much time in fq_codel_reset(),
most of it are unnecessary as it is going to be destroyed right after
resetting.

Note: please do not judge the patch, it is merely for testing purpose.
It is obviously ugly and is only a proof of concept. A complete one
should be passing a boolean parameter down to each ->reset(),
but it would be much larger.

Thanks for testing!

[-- Attachment #2: fq_codel_reset.diff --]
[-- Type: application/octet-stream, Size: 585 bytes --]

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 968519ff36e9..df781fd4ff56 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -337,6 +337,15 @@ static void fq_codel_reset(struct Qdisc *sch)
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	int i;
 
+	if (refcount_read(&sch->refcnt) == 0) {
+		for (i = 0; i < q->flows_cnt; i++) {
+			struct fq_codel_flow *flow = q->flows + i;
+
+			fq_codel_flow_purge(flow);
+		}
+		return;
+	}
+
 	INIT_LIST_HEAD(&q->new_flows);
 	INIT_LIST_HEAD(&q->old_flows);
 	for (i = 0; i < q->flows_cnt; i++) {

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-25 17:43         ` Cong Wang
@ 2020-03-25 18:23           ` Václav Zindulka
  2020-03-26 14:24             ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-25 18:23 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Wed, Mar 25, 2020 at 6:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Mar 25, 2020 at 4:28 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > On Tue, Mar 24, 2020 at 11:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > Hm, my bad, please also run `perf report -g` after you record them,
> > > we need the text output with stack traces.
> >
> > No problem. I've created reports on two servers with different cards.
> > See here https://github.com/zvalcav/tc-kernel/tree/master/20200325
>
> That is great!
>
> Your kernel log does not show anything useful, so it did not lead to
> any kernel hang or crash etc. at all. (This also means you do not need
> to try kdump.)

Ok.

>
> Are you able to test an experimental patch attached in this email?

Sure. I'll compile new kernel tomorrow. Thank you for quick patch.
I'll let you know as soon as I have anything.

> It looks like your kernel spent too much time in fq_codel_reset(),
> most of it are unnecessary as it is going to be destroyed right after
> resetting.

Yeah, I noticed it too.

> Note: please do not judge the patch, it is merely for testing purpose.
> It is obviously ugly and is only a proof of concept. A complete one
> should be passing a boolean parameter down to each ->reset(),
> but it would be much larger.

No judging at all. I totally understand this as prototype ;-)

>
> Thanks for testing!

I thank you! It may allow me to finally deploy a year of my work time.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-25 18:23           ` Václav Zindulka
@ 2020-03-26 14:24             ` Václav Zindulka
  2020-03-26 17:07               ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-26 14:24 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

> On Wed, Mar 25, 2020 at 6:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > Are you able to test an experimental patch attached in this email?
>
> Sure. I'll compile new kernel tomorrow. Thank you for quick patch.
> I'll let you know as soon as I have anything.

I've compiled kernel with the patch you provided and tested it.
However the problem is not solved. It behaves exactly the same way.
I'm trying to put some printk into the fq_codel_reset() to test it
more.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-26 14:24             ` Václav Zindulka
@ 2020-03-26 17:07               ` Cong Wang
  2020-03-26 17:38                 ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-03-26 17:07 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Thu, Mar 26, 2020 at 7:24 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> > On Wed, Mar 25, 2020 at 6:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > Are you able to test an experimental patch attached in this email?
> >
> > Sure. I'll compile new kernel tomorrow. Thank you for quick patch.
> > I'll let you know as soon as I have anything.
>
> I've compiled kernel with the patch you provided and tested it.
> However the problem is not solved. It behaves exactly the same way.
> I'm trying to put some printk into the fq_codel_reset() to test it
> more.

Are the stack traces captured by perf any different with unpatched?

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-26 17:07               ` Cong Wang
@ 2020-03-26 17:38                 ` Cong Wang
  2020-03-27 10:35                   ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-03-26 17:38 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Thu, Mar 26, 2020 at 10:07 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Mar 26, 2020 at 7:24 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > > On Wed, Mar 25, 2020 at 6:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > Are you able to test an experimental patch attached in this email?
> > >
> > > Sure. I'll compile new kernel tomorrow. Thank you for quick patch.
> > > I'll let you know as soon as I have anything.
> >
> > I've compiled kernel with the patch you provided and tested it.
> > However the problem is not solved. It behaves exactly the same way.
> > I'm trying to put some printk into the fq_codel_reset() to test it
> > more.
>
> Are the stack traces captured by perf any different with unpatched?

Wait, it seems my assumption of refcnt==0 is wrong even for deletion,
in qdisc_graft() the last refcnt is put after dev_deactivate()...

Let me think about how we could distinguish this case from other
reset cases.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-26 17:38                 ` Cong Wang
@ 2020-03-27 10:35                   ` Václav Zindulka
  2020-03-28 13:04                     ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-27 10:35 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Thu, Mar 26, 2020 at 6:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
>Are the stack traces captured by perf any different with unpatched?

I've compared initial perf reports. There is difference in call stack
between sfp+ interfaces vs ifb and metallic interfaces. While ifb and
metallic interfaces have normal behavior on sfp+ interfaces the call
stack is different. I've added perf record of eno1 metallic interface
here https://github.com/zvalcav/tc-kernel/tree/master/20200325/Intel-82599ES
so you can compare the behavior of all of them. I've also added perf
report of patched kernel.

--100.00%--entry_SYSCALL_64_after_hwframe
          do_syscall_64
          __sys_sendmsg
          ___sys_sendmsg
          ____sys_sendmsg
          sock_sendmsg
          netlink_sendmsg
          netlink_unicast
          netlink_rcv_skb
                                   <-- call stack is similar to other
interfaces
          |
           --99.98%--rtnetlink_rcv_msg
                     tc_get_qdisc
                     qdisc_graft
                                      <-- call stack splits in this
function - dev_deactivate is called only on sfp+ interfaces
                     |
                     |--98.36%--dev_deactivate
                              <-- 98% calls of fq_codel_reset()
function are done here. These must be the excessive calls - see below.
                     |          dev_deactivate_many
                     |          |
                     |          |--49.28%--dev_deactivate_queue.constprop.57
                     |          |          |
                     |          |           --49.27%--qdisc_reset
                     |          |                     hfsc_reset_qdisc
                     |          |                     |
                     |          |                      --48.69%--qdisc_reset
                     |          |                                |
                     |          |
--47.23%--fq_codel_reset             <-- half of excessive function
calls come from here
                     |          |                                           |
                     |          |
     |--3.20%--codel_vars_init
                     |          |                                           |
                     |          |
     |--1.64%--rtnl_kfree_skbs
                     |          |                                           |
                     |          |
      --0.81%--memset_erms
                     |          |
                     |           --49.08%--qdisc_reset
                     |                     hfsc_reset_qdisc
                     |                     |
                     |                      --48.53%--qdisc_reset
                     |                                |
                     |
--47.09%--fq_codel_reset                       <-- other half of
excessive function calls come from here
                     |                                           |
                     |
|--2.90%--codel_vars_init
                     |                                           |
                     |
|--1.61%--rtnl_kfree_skbs
                     |                                           |
                     |
--0.82%--memset_erms
                     |
                      --1.62%--qdisc_destroy
                               <-- here are remaining, (I suppose)
regular calls. Call stack is similar to other interfaces here.
                                |
                                |--0.86%--hfsc_destroy_qdisc
                                |          |
                                |           --0.82%--hfsc_destroy_class
                                |                     |
                                |                      --0.81%--qdisc_destroy
                                |                                |
                                |
--0.72%--fq_codel_reset
                                |
                                 --0.77%--hfsc_reset_qdisc
                                           |
                                            --0.75%--qdisc_reset
                                                      |
                                                       --0.70%--fq_codel_reset

> On Thu, Mar 26, 2020 at 10:07 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Thu, Mar 26, 2020 at 7:24 AM Václav Zindulka
> > <vaclav.zindulka@tlapnet.cz> wrote:
> > >
> > > > On Wed, Mar 25, 2020 at 6:43 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > > Are you able to test an experimental patch attached in this email?
> > > >
> > > > Sure. I'll compile new kernel tomorrow. Thank you for quick patch.
> > > > I'll let you know as soon as I have anything.
> > >
> > > I've compiled kernel with the patch you provided and tested it.
> > > However the problem is not solved. It behaves exactly the same way.
> > > I'm trying to put some printk into the fq_codel_reset() to test it
> > > more.
> >
> > Are the stack traces captured by perf any different with unpatched?
>
> Wait, it seems my assumption of refcnt==0 is wrong even for deletion,
> in qdisc_graft() the last refcnt is put after dev_deactivate()...

Your assumption is not totally wrong. I have added some printks into
fq_codel_reset() function. Final passes during deletion are processed
in the if condition you added in the patch - 13706. Yet the rest and
most of them go through regular routine - 1768074. 1024 is value of i
in for loop.

But the most significant problem is that fq_codel_reset() function is
processed too many times. According to kern.log it was being processed
about 77 000 times per second for almost 24 seconds. It was over 1.7
million passes during rules deletion.

Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791176] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791184] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791192] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791199] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791207] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791214] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791221] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791228] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791236] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791243] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791251] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791258] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791266] Function
fq_codel_reset - regular exit - passes: 1024
Mar 26 15:26:23 shaperd-prelouc1-dev kernel: [  110.791273] Function
fq_codel_reset - regular exit - passes: 1024

Here is another 1.7 million of lines... Function doesn't take much
time according to timestamp but it is being called way too many times.

Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017811] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017823] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017836] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017849] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017862] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017874] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 15:26:46 shaperd-prelouc1-dev kernel: [  134.017886] Function
fq_codel_reset - patch exit - passes: 1024
Mar 26 21:05:54 shaperd-prelouc1-dev kernel: [  134.017899] Function
fq_codel_reset - patch exit - passes: 1024

>
> Let me think about how we could distinguish this case from other
> reset cases.

As above, the problem is most probably in excessive calls of
fq_codel_reset(). I'm not surprised it freezes whole server or network
adapter.

Thank you

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-27 10:35                   ` Václav Zindulka
@ 2020-03-28 13:04                     ` Václav Zindulka
  2020-03-31  5:59                       ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-28 13:04 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Fri, Mar 27, 2020 at 11:35 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> Your assumption is not totally wrong. I have added some printks into
> fq_codel_reset() function. Final passes during deletion are processed
> in the if condition you added in the patch - 13706. Yet the rest and
> most of them go through regular routine - 1768074. 1024 is value of i
> in for loop.

Ok, so I went through the kernel source a little bit. I've found out
that dev_deactivate is called only for interfaces that are up. My bad
I forgot that after deactivation of my daemon ifb interfaces are set
to down. Nevertheless after setting it up and doing perf record on
ifb0 numbers are much lower anyway. 13706 exits through your condition
added in patch. 41118 regular exits. I've uploaded perf report here
https://github.com/zvalcav/tc-kernel/tree/master/20200328

I've also tried this on metallic interface on different server which
has a link on it. There were 39651 patch exits. And 286412 regular
exits. It is more than ifb interface, yet it is way less than sfp+
interface and behaves correctly.

BTW sorry for previous missformated mail. I was trying to put it
inside perf report for better visibility.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-28 13:04                     ` Václav Zindulka
@ 2020-03-31  5:59                       ` Cong Wang
  2020-03-31 12:46                         ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-03-31  5:59 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Sat, Mar 28, 2020 at 6:04 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Fri, Mar 27, 2020 at 11:35 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > Your assumption is not totally wrong. I have added some printks into
> > fq_codel_reset() function. Final passes during deletion are processed
> > in the if condition you added in the patch - 13706. Yet the rest and
> > most of them go through regular routine - 1768074. 1024 is value of i
> > in for loop.
>
> Ok, so I went through the kernel source a little bit. I've found out
> that dev_deactivate is called only for interfaces that are up. My bad
> I forgot that after deactivation of my daemon ifb interfaces are set
> to down. Nevertheless after setting it up and doing perf record on
> ifb0 numbers are much lower anyway. 13706 exits through your condition
> added in patch. 41118 regular exits. I've uploaded perf report here
> https://github.com/zvalcav/tc-kernel/tree/master/20200328
>
> I've also tried this on metallic interface on different server which
> has a link on it. There were 39651 patch exits. And 286412 regular
> exits. It is more than ifb interface, yet it is way less than sfp+
> interface and behaves correctly.

Interesting, at the point of dev_deactivate() is called, the refcnt
should not be zero, it should be at least 1, so my patch should
not affect dev_deactivate(), it does affect the last qdisc_put()
after it.

Of course, my intention is indeed to eliminate all of the
unnecessary memset() in the ->reset() before ->destroy().
I will provide you a complete patch tomorrow if you can test
it, which should improve hfsc_reset() too.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-31  5:59                       ` Cong Wang
@ 2020-03-31 12:46                         ` Václav Zindulka
  2020-04-08 20:18                           ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-03-31 12:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Tue, Mar 31, 2020 at 8:00 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Sat, Mar 28, 2020 at 6:04 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > On Fri, Mar 27, 2020 at 11:35 AM Václav Zindulka
> > <vaclav.zindulka@tlapnet.cz> wrote:
> > >
> > > Your assumption is not totally wrong. I have added some printks into
> > > fq_codel_reset() function. Final passes during deletion are processed
> > > in the if condition you added in the patch - 13706. Yet the rest and
> > > most of them go through regular routine - 1768074. 1024 is value of i
> > > in for loop.
> >
> > Ok, so I went through the kernel source a little bit. I've found out
> > that dev_deactivate is called only for interfaces that are up. My bad
> > I forgot that after deactivation of my daemon ifb interfaces are set
> > to down. Nevertheless after setting it up and doing perf record on
> > ifb0 numbers are much lower anyway. 13706 exits through your condition
> > added in patch. 41118 regular exits. I've uploaded perf report here
> > https://github.com/zvalcav/tc-kernel/tree/master/20200328
> >
> > I've also tried this on metallic interface on different server which
> > has a link on it. There were 39651 patch exits. And 286412 regular
> > exits. It is more than ifb interface, yet it is way less than sfp+
> > interface and behaves correctly.
>
> Interesting, at the point of dev_deactivate() is called, the refcnt
> should not be zero, it should be at least 1, so my patch should
> not affect dev_deactivate(), it does affect the last qdisc_put()
> after it.
>
> Of course, my intention is indeed to eliminate all of the
> unnecessary memset() in the ->reset() before ->destroy().
> I will provide you a complete patch tomorrow if you can test
> it, which should improve hfsc_reset() too.
>
> Thanks.

Sure, I'll test it and I'm looking forward to it. :-)

Thank you for all your help and effort. I appreciate it a lot.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-03-31 12:46                         ` Václav Zindulka
@ 2020-04-08 20:18                           ` Cong Wang
  2020-04-12 20:17                             ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-04-08 20:18 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

Hi, Václav

Sorry for the delay.

The problem is actually more complicated than I thought, although it
needs more work, below is the first pile of patches I have for you to
test:

https://github.com/congwang/linux/commits/qdisc_reset

It is based on the latest net-next branch. Please let me know the result.

Meanwhile, I will continue to work on this and will provide you more
patches on top.

Thanks for testing!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-04-08 20:18                           ` Cong Wang
@ 2020-04-12 20:17                             ` Václav Zindulka
  2020-04-13 17:28                               ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-04-12 20:17 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Wed, Apr 8, 2020 at 10:18 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi, Václav

Hello Cong,

> Sorry for the delay.

No problem, I'm actually glad you are diagnosing the problem further.
I didn't have much time until today and late yesterday to apply
patches and to test it.

> The problem is actually more complicated than I thought, although it
> needs more work, below is the first pile of patches I have for you to
> test:
>
> https://github.com/congwang/linux/commits/qdisc_reset
>
> It is based on the latest net-next branch. Please let me know the result.

I have applied all the patches in your four commits to my custom 5.4.6
kernel source. There was no change in the amount of fq_codel_reset
calls. Tested on ifb, RJ45 and SFP+ interfaces.

> Meanwhile, I will continue to work on this and will provide you more
> patches on top.

Thank you very much for your effort!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-04-12 20:17                             ` Václav Zindulka
@ 2020-04-13 17:28                               ` Cong Wang
  2020-04-15 15:01                                 ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-04-13 17:28 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Sun, Apr 12, 2020 at 1:18 PM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Wed, Apr 8, 2020 at 10:18 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Hi, Václav
>
> Hello Cong,
>
> > Sorry for the delay.
>
> No problem, I'm actually glad you are diagnosing the problem further.
> I didn't have much time until today and late yesterday to apply
> patches and to test it.
>
> > The problem is actually more complicated than I thought, although it
> > needs more work, below is the first pile of patches I have for you to
> > test:
> >
> > https://github.com/congwang/linux/commits/qdisc_reset
> >
> > It is based on the latest net-next branch. Please let me know the result.
>
> I have applied all the patches in your four commits to my custom 5.4.6
> kernel source. There was no change in the amount of fq_codel_reset
> calls. Tested on ifb, RJ45 and SFP+ interfaces.

It is true my patches do not reduce the number of fq_codel_reset() calls,
they are intended to reduce the CPU time spent in each fq_codel_reset().

Can you measure this? Note, you do not have to add your own printk()
any more, because my patches add a few tracepoints, especially for
qdisc_reset(). So you can obtain the time by checking the timestamps
of these trace events. Of course, you can also use perf trace like you
did before.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-04-13 17:28                               ` Cong Wang
@ 2020-04-15 15:01                                 ` Václav Zindulka
  2020-04-30 12:40                                   ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-04-15 15:01 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Mon, Apr 13, 2020 at 7:29 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Sun, Apr 12, 2020 at 1:18 PM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > On Wed, Apr 8, 2020 at 10:18 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > Hi, Václav
> >
> > Hello Cong,
> >
> > > Sorry for the delay.
> >
> > No problem, I'm actually glad you are diagnosing the problem further.
> > I didn't have much time until today and late yesterday to apply
> > patches and to test it.
> >
> > > The problem is actually more complicated than I thought, although it
> > > needs more work, below is the first pile of patches I have for you to
> > > test:
> > >
> > > https://github.com/congwang/linux/commits/qdisc_reset
> > >
> > > It is based on the latest net-next branch. Please let me know the result.
> >
> > I have applied all the patches in your four commits to my custom 5.4.6
> > kernel source. There was no change in the amount of fq_codel_reset
> > calls. Tested on ifb, RJ45 and SFP+ interfaces.
>
> It is true my patches do not reduce the number of fq_codel_reset() calls,
> they are intended to reduce the CPU time spent in each fq_codel_reset().
>
> Can you measure this? Note, you do not have to add your own printk()
> any more, because my patches add a few tracepoints, especially for
> qdisc_reset(). So you can obtain the time by checking the timestamps
> of these trace events. Of course, you can also use perf trace like you
> did before.

Sorry for delayed responses. We were moving to a new house so I didn't
have much time to test it. I've measured your pile of patches applied
vs unpatched kernel. Result is a little bit better, but it is only
about 1s faster. Results are here. Do you need any additional reports
or measurements of other interfaces?
https://github.com/zvalcav/tc-kernel/tree/master/20200415 I've
recompiled the kernel without printk which had some overhead too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-04-15 15:01                                 ` Václav Zindulka
@ 2020-04-30 12:40                                   ` Václav Zindulka
  2020-05-04 17:46                                     ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-04-30 12:40 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Wed, Apr 15, 2020 at 5:01 PM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
> > > > The problem is actually more complicated than I thought, although it
> > > > needs more work, below is the first pile of patches I have for you to
> > > > test:
> > > >
> > > > https://github.com/congwang/linux/commits/qdisc_reset
> > > >
> > > > It is based on the latest net-next branch. Please let me know the result.
> > >
> > > I have applied all the patches in your four commits to my custom 5.4.6
> > > kernel source. There was no change in the amount of fq_codel_reset
> > > calls. Tested on ifb, RJ45 and SFP+ interfaces.
> >
> > It is true my patches do not reduce the number of fq_codel_reset() calls,
> > they are intended to reduce the CPU time spent in each fq_codel_reset().
> >
> > Can you measure this? Note, you do not have to add your own printk()
> > any more, because my patches add a few tracepoints, especially for
> > qdisc_reset(). So you can obtain the time by checking the timestamps
> > of these trace events. Of course, you can also use perf trace like you
> > did before.
>
> Sorry for delayed responses. We were moving to a new house so I didn't
> have much time to test it. I've measured your pile of patches applied
> vs unpatched kernel. Result is a little bit better, but it is only
> about 1s faster. Results are here. Do you need any additional reports
> or measurements of other interfaces?
> https://github.com/zvalcav/tc-kernel/tree/master/20200415 I've
> recompiled the kernel without printk which had some overhead too.

Hello Cong,

did you have any time to look at it further? I'm just asking since my
boss wants me to give him some verdict. I've started to study eBPF and
XDP in the meantime so we have an alternative in case there won't be a
solution to this problem.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-04-30 12:40                                   ` Václav Zindulka
@ 2020-05-04 17:46                                     ` Cong Wang
  2020-05-04 20:36                                       ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-05-04 17:46 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Thu, Apr 30, 2020 at 5:40 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Wed, Apr 15, 2020 at 5:01 PM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> > > > > The problem is actually more complicated than I thought, although it
> > > > > needs more work, below is the first pile of patches I have for you to
> > > > > test:
> > > > >
> > > > > https://github.com/congwang/linux/commits/qdisc_reset
> > > > >
> > > > > It is based on the latest net-next branch. Please let me know the result.
> > > >
> > > > I have applied all the patches in your four commits to my custom 5.4.6
> > > > kernel source. There was no change in the amount of fq_codel_reset
> > > > calls. Tested on ifb, RJ45 and SFP+ interfaces.
> > >
> > > It is true my patches do not reduce the number of fq_codel_reset() calls,
> > > they are intended to reduce the CPU time spent in each fq_codel_reset().
> > >
> > > Can you measure this? Note, you do not have to add your own printk()
> > > any more, because my patches add a few tracepoints, especially for
> > > qdisc_reset(). So you can obtain the time by checking the timestamps
> > > of these trace events. Of course, you can also use perf trace like you
> > > did before.
> >
> > Sorry for delayed responses. We were moving to a new house so I didn't
> > have much time to test it. I've measured your pile of patches applied
> > vs unpatched kernel. Result is a little bit better, but it is only
> > about 1s faster. Results are here. Do you need any additional reports
> > or measurements of other interfaces?
> > https://github.com/zvalcav/tc-kernel/tree/master/20200415 I've
> > recompiled the kernel without printk which had some overhead too.
>
> Hello Cong,
>
> did you have any time to look at it further? I'm just asking since my
> boss wants me to give him some verdict. I've started to study eBPF and
> XDP in the meantime so we have an alternative in case there won't be a
> solution to this problem.

Sorry for the delay. I lost connection to my dev machine, I am trying
to setup this on my own laptop.

Regarding to your test result above, I think I saw some difference
on my side, I have no idea why you didn't see any difference. Please
let me collect the data once I setup the test environment shortly today.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-04 17:46                                     ` Cong Wang
@ 2020-05-04 20:36                                       ` Cong Wang
  2020-05-05  8:46                                         ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-05-04 20:36 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Mon, May 4, 2020 at 10:46 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Regarding to your test result above, I think I saw some difference
> on my side, I have no idea why you didn't see any difference. Please
> let me collect the data once I setup the test environment shortly today.

I tried to emulate your test case in my VM, here is the script I use:

====
ip li set dev dummy0 up
tc qd add dev dummy0 root handle 1: htb default 1
for i in `seq 1 1000`
do
  tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
  tc qd add dev dummy0 parent 1:$i fq_codel
done

time tc qd del dev dummy0 root
====

And this is the result:

    Before my patch:
     real   0m0.488s
     user   0m0.000s
     sys    0m0.325s

    After my patch:
     real   0m0.180s
     user   0m0.000s
     sys    0m0.132s

This is an obvious improvement, so I have no idea why you didn't
catch any difference.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-04 20:36                                       ` Cong Wang
@ 2020-05-05  8:46                                         ` Václav Zindulka
  2020-05-07 18:52                                           ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-05-05  8:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Mon, May 4, 2020 at 7:46 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Sorry for the delay. I lost connection to my dev machine, I am trying
> to setup this on my own laptop.

Sorry to hear that. I will gladly give you access to my testing
machine where all this nasty stuff happens every time so you can test
it in place. You can try everything there and have online results. I
can give you access even to the IPMI console so you can switch the
kernel during boot easily. I didn't notice this problem until the time
of deployment. My prior testing machines were with metallic ethernet
ports only so I didn't know about those problems earlier.

> On Mon, May 4, 2020 at 10:36 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Regarding to your test result above, I think I saw some difference
> > on my side, I have no idea why you didn't see any difference. Please
> > let me collect the data once I setup the test environment shortly today.

I saw some improvement too. It was more than 1.5s faster. Yet it was
still over 21s. I measured it with perf trace, not with time. I'll try
it the same way as you did.

>
> I tried to emulate your test case in my VM, here is the script I use:
>
> ====
> ip li set dev dummy0 up
> tc qd add dev dummy0 root handle 1: htb default 1
> for i in `seq 1 1000`
> do
>   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
>   tc qd add dev dummy0 parent 1:$i fq_codel
> done
>
> time tc qd del dev dummy0 root
> ====
>
> And this is the result:
>
>     Before my patch:
>      real   0m0.488s
>      user   0m0.000s
>      sys    0m0.325s
>
>     After my patch:
>      real   0m0.180s
>      user   0m0.000s
>      sys    0m0.132s

My results with your test script.

before patch:
/usr/bin/time -p tc qdisc del dev enp1s0f0 root
real 1.63
user 0.00
sys 1.63

after patch:
/usr/bin/time -p tc qdisc del dev enp1s0f0 root
real 1.55
user 0.00
sys 1.54

> This is an obvious improvement, so I have no idea why you didn't
> catch any difference.

We use hfsc instead of htb. I don't know whether it may cause any
difference. I can provide you with my test scripts if necessary.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-05  8:46                                         ` Václav Zindulka
@ 2020-05-07 18:52                                           ` Cong Wang
  2020-05-08 13:59                                             ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-05-07 18:52 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Tue, May 5, 2020 at 1:46 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Mon, May 4, 2020 at 7:46 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Sorry for the delay. I lost connection to my dev machine, I am trying
> > to setup this on my own laptop.
>
> Sorry to hear that. I will gladly give you access to my testing
> machine where all this nasty stuff happens every time so you can test
> it in place. You can try everything there and have online results. I
> can give you access even to the IPMI console so you can switch the
> kernel during boot easily. I didn't notice this problem until the time
> of deployment. My prior testing machines were with metallic ethernet
> ports only so I didn't know about those problems earlier.

Thanks for the offer! No worries, I setup a testing VM on my laptop.

>
> > On Mon, May 4, 2020 at 10:36 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > Regarding to your test result above, I think I saw some difference
> > > on my side, I have no idea why you didn't see any difference. Please
> > > let me collect the data once I setup the test environment shortly today.
>
> I saw some improvement too. It was more than 1.5s faster. Yet it was
> still over 21s. I measured it with perf trace, not with time. I'll try
> it the same way as you did.
>
> >
> > I tried to emulate your test case in my VM, here is the script I use:
> >
> > ====
> > ip li set dev dummy0 up
> > tc qd add dev dummy0 root handle 1: htb default 1
> > for i in `seq 1 1000`
> > do
> >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> >   tc qd add dev dummy0 parent 1:$i fq_codel
> > done
> >
> > time tc qd del dev dummy0 root
> > ====
> >
> > And this is the result:
> >
> >     Before my patch:
> >      real   0m0.488s
> >      user   0m0.000s
> >      sys    0m0.325s
> >
> >     After my patch:
> >      real   0m0.180s
> >      user   0m0.000s
> >      sys    0m0.132s
>
> My results with your test script.
>
> before patch:
> /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> real 1.63
> user 0.00
> sys 1.63
>
> after patch:
> /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> real 1.55
> user 0.00
> sys 1.54
>
> > This is an obvious improvement, so I have no idea why you didn't
> > catch any difference.
>
> We use hfsc instead of htb. I don't know whether it may cause any
> difference. I can provide you with my test scripts if necessary.

Yeah, you can try to replace the htb with hfsc in my script,
I didn't spend time to figure out hfsc parameters.

My point here is, if I can see the difference with merely 1000
tc classes, you should see a bigger difference with hundreds
of thousands classes in your setup. So, I don't know why you
saw a relatively smaller difference.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-07 18:52                                           ` Cong Wang
@ 2020-05-08 13:59                                             ` Václav Zindulka
  2020-05-17 19:35                                               ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-05-08 13:59 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Thu, May 7, 2020 at 8:52 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, May 5, 2020 at 1:46 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > On Mon, May 4, 2020 at 7:46 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > Sorry for the delay. I lost connection to my dev machine, I am trying
> > > to setup this on my own laptop.
> >
> > Sorry to hear that. I will gladly give you access to my testing
> > machine where all this nasty stuff happens every time so you can test
> > it in place. You can try everything there and have online results. I
> > can give you access even to the IPMI console so you can switch the
> > kernel during boot easily. I didn't notice this problem until the time
> > of deployment. My prior testing machines were with metallic ethernet
> > ports only so I didn't know about those problems earlier.
>
> Thanks for the offer! No worries, I setup a testing VM on my laptop.

OK

> > >
> > > I tried to emulate your test case in my VM, here is the script I use:
> > >
> > > ====
> > > ip li set dev dummy0 up
> > > tc qd add dev dummy0 root handle 1: htb default 1
> > > for i in `seq 1 1000`
> > > do
> > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > done
> > >
> > > time tc qd del dev dummy0 root
> > > ====
> > >
> > > And this is the result:
> > >
> > >     Before my patch:
> > >      real   0m0.488s
> > >      user   0m0.000s
> > >      sys    0m0.325s
> > >
> > >     After my patch:
> > >      real   0m0.180s
> > >      user   0m0.000s
> > >      sys    0m0.132s
> >
> > My results with your test script.
> >
> > before patch:
> > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > real 1.63
> > user 0.00
> > sys 1.63
> >
> > after patch:
> > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > real 1.55
> > user 0.00
> > sys 1.54
> >
> > > This is an obvious improvement, so I have no idea why you didn't
> > > catch any difference.
> >
> > We use hfsc instead of htb. I don't know whether it may cause any
> > difference. I can provide you with my test scripts if necessary.
>
> Yeah, you can try to replace the htb with hfsc in my script,
> I didn't spend time to figure out hfsc parameters.

class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
13107200 ul m1 0 d 0 m2 13107200

but it behaves the same as htb...

> My point here is, if I can see the difference with merely 1000
> tc classes, you should see a bigger difference with hundreds
> of thousands classes in your setup. So, I don't know why you
> saw a relatively smaller difference.

I saw a relatively big difference. It was about 1.5s faster on my huge
setup which is a lot. Yet maybe the problem is caused by something
else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
queues. I've tried to increase the number of queues using ethtool from
4 to 8 and decreased to 2. But there was no difference. It was about
1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
patches applied. I've tried it for ifb and RJ45 interfaces where it
took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
patches applied, which is strange, but it may be caused by the fact it
was very fast even before.

I've commits c71c00df335f6aff00d3dc7f28e06dc8abc088a7,
13a5aec17cc65f6aa5c3bc470f504650bd465a69,
720cc6b0d12fb7c8a494e441ebd360c62023dad2,
51287a4bc6f2addd4a8c1919829aab3bb7c706c9 from
https://github.com/congwang/linux/commits/qdisc_reset applied on 5.4.6
kernel. I can apply them on the newest one if it can have any impact.
I hope I've applied the right patches and haven't missed any older
commits.

I've even tried to compile the kernel from your repository - branch
qdisc_reset. Times are a little bit lower than with patched 5.4.6.
1.52 - 1.53. Yet I still can't get to great improvement like you saw.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-08 13:59                                             ` Václav Zindulka
@ 2020-05-17 19:35                                               ` Cong Wang
  2020-05-18 14:16                                                 ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-05-17 19:35 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Thu, May 7, 2020 at 8:52 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Tue, May 5, 2020 at 1:46 AM Václav Zindulka
> > <vaclav.zindulka@tlapnet.cz> wrote:
> > >
> > > On Mon, May 4, 2020 at 7:46 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > Sorry for the delay. I lost connection to my dev machine, I am trying
> > > > to setup this on my own laptop.
> > >
> > > Sorry to hear that. I will gladly give you access to my testing
> > > machine where all this nasty stuff happens every time so you can test
> > > it in place. You can try everything there and have online results. I
> > > can give you access even to the IPMI console so you can switch the
> > > kernel during boot easily. I didn't notice this problem until the time
> > > of deployment. My prior testing machines were with metallic ethernet
> > > ports only so I didn't know about those problems earlier.
> >
> > Thanks for the offer! No worries, I setup a testing VM on my laptop.
>
> OK
>
> > > >
> > > > I tried to emulate your test case in my VM, here is the script I use:
> > > >
> > > > ====
> > > > ip li set dev dummy0 up
> > > > tc qd add dev dummy0 root handle 1: htb default 1
> > > > for i in `seq 1 1000`
> > > > do
> > > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > > done
> > > >
> > > > time tc qd del dev dummy0 root
> > > > ====
> > > >
> > > > And this is the result:
> > > >
> > > >     Before my patch:
> > > >      real   0m0.488s
> > > >      user   0m0.000s
> > > >      sys    0m0.325s
> > > >
> > > >     After my patch:
> > > >      real   0m0.180s
> > > >      user   0m0.000s
> > > >      sys    0m0.132s
> > >
> > > My results with your test script.
> > >
> > > before patch:
> > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > real 1.63
> > > user 0.00
> > > sys 1.63
> > >
> > > after patch:
> > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > real 1.55
> > > user 0.00
> > > sys 1.54
> > >
> > > > This is an obvious improvement, so I have no idea why you didn't
> > > > catch any difference.
> > >
> > > We use hfsc instead of htb. I don't know whether it may cause any
> > > difference. I can provide you with my test scripts if necessary.
> >
> > Yeah, you can try to replace the htb with hfsc in my script,
> > I didn't spend time to figure out hfsc parameters.
>
> class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
> 13107200 ul m1 0 d 0 m2 13107200
>
> but it behaves the same as htb...
>
> > My point here is, if I can see the difference with merely 1000
> > tc classes, you should see a bigger difference with hundreds
> > of thousands classes in your setup. So, I don't know why you
> > saw a relatively smaller difference.
>
> I saw a relatively big difference. It was about 1.5s faster on my huge
> setup which is a lot. Yet maybe the problem is caused by something

What percentage? IIUC, without patch it took you about 11s, so
1.5s faster means 13% improvement for you?


> else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
> queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
> queues. I've tried to increase the number of queues using ethtool from
> 4 to 8 and decreased to 2. But there was no difference. It was about
> 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
> patches applied. I've tried it for ifb and RJ45 interfaces where it
> took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
> patches applied, which is strange, but it may be caused by the fact it
> was very fast even before.

That is odd. In fact, this is highly related to number of TX queues,
because the existing code resets the qdisc's once for each TX
queue, so the more TX queues you have, the more resets kernel
will do, that is the more time it will take.

I plan to address this later on top of the existing patches.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-17 19:35                                               ` Cong Wang
@ 2020-05-18 14:16                                                 ` Václav Zindulka
  2020-05-18 18:22                                                   ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-05-18 14:16 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

[-- Attachment #1: Type: text/plain, Size: 5986 bytes --]

On Sun, May 17, 2020 at 9:35 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> > > > >
> > > > > I tried to emulate your test case in my VM, here is the script I use:
> > > > >
> > > > > ====
> > > > > ip li set dev dummy0 up
> > > > > tc qd add dev dummy0 root handle 1: htb default 1
> > > > > for i in `seq 1 1000`
> > > > > do
> > > > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > > > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > > > done
> > > > >
> > > > > time tc qd del dev dummy0 root
> > > > > ====
> > > > >
> > > > > And this is the result:
> > > > >
> > > > >     Before my patch:
> > > > >      real   0m0.488s
> > > > >      user   0m0.000s
> > > > >      sys    0m0.325s
> > > > >
> > > > >     After my patch:
> > > > >      real   0m0.180s
> > > > >      user   0m0.000s
> > > > >      sys    0m0.132s
> > > >
> > > > My results with your test script.
> > > >
> > > > before patch:
> > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > real 1.63
> > > > user 0.00
> > > > sys 1.63
> > > >
> > > > after patch:
> > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > real 1.55
> > > > user 0.00
> > > > sys 1.54
> > > >
> > > > > This is an obvious improvement, so I have no idea why you didn't
> > > > > catch any difference.
> > > >
> > > > We use hfsc instead of htb. I don't know whether it may cause any
> > > > difference. I can provide you with my test scripts if necessary.
> > >
> > > Yeah, you can try to replace the htb with hfsc in my script,
> > > I didn't spend time to figure out hfsc parameters.
> >
> > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
> > 13107200 ul m1 0 d 0 m2 13107200
> >
> > but it behaves the same as htb...
> >
> > > My point here is, if I can see the difference with merely 1000
> > > tc classes, you should see a bigger difference with hundreds
> > > of thousands classes in your setup. So, I don't know why you
> > > saw a relatively smaller difference.
> >
> > I saw a relatively big difference. It was about 1.5s faster on my huge
> > setup which is a lot. Yet maybe the problem is caused by something
>
> What percentage? IIUC, without patch it took you about 11s, so
> 1.5s faster means 13% improvement for you?

My whole setup needs 22.17 seconds to delete with an unpatched kernel.
With your patches applied it is 21.08. So it varies between 1 - 1.5s.
Improvement is about 5 - 6%.

> > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
> > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
> > queues. I've tried to increase the number of queues using ethtool from
> > 4 to 8 and decreased to 2. But there was no difference. It was about
> > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
> > patches applied. I've tried it for ifb and RJ45 interfaces where it
> > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
> > patches applied, which is strange, but it may be caused by the fact it
> > was very fast even before.
>
> That is odd. In fact, this is highly related to number of TX queues,
> because the existing code resets the qdisc's once for each TX
> queue, so the more TX queues you have, the more resets kernel
> will do, that is the more time it will take.

Can't the problem be caused that reset is done on active and inactive
queues every time? It would explain why it had no effect in decreasing
and increasing the number of active queues. Yet it doesn't explain why
Intel card (82599ES) with 64 possible queues has exactly the same
problem as Mellanox (ConnectX-4 LX) with 8 possible queues.

I've been playing with this problem today. Every deleted fq_codel
qdisc, root and non root, requires a network adapter to empty all
possible queues. But not just the active ones, but it cleared all
possible queues. Event those the adapter can't even use. For every
SFP+ I have tested it calls fq_codel_reset() and would call any other
reset function. 64 times for egress qdisc and 64 times for ingress
qdisc. Even when ingress is not defined. Solution to this whole
problem would be to let reset only activated queues. On the RJ45
interface I've tested there are 8 possible queues. So the reset is
done 8 times for every deleted qdisc. 16 in total, since ingress and
egress are processed all the time.

So a little bit of calculation. My initial setup contained 13706 qdisc
rules. For ifb it means 13706 for egress and 13706 for ingress. 27412
reset calls because there can be only one transmit queue for the ifb
interface. Average time spent between fq_codel_reset() according to my
initial perf reports is somewhat between 7 - 16 micro seconds. 27412 *
0.000008 = 0.219296 s. For RJ45 interface it does 8 calls for every
qdisc. 13706 * 8 * 2 = 219296 resets. 219296 * 0.000008 = 1.754368 s.
It is still ok. For SFP+ interface it is 64 resets per qdisc rule.
13706 * 64 * 2 = 1754368. So we are very close to the huge number I
noticed initially using printk. And now 1754368 * 0.000008 =
14.034944s. In case of slowest calls 1754368 * 0.000016 = 28.069888.
Woala. Gotcha. So my final judgement is - don't empty something we
don't use anyway. For Intel it may be reasonable, it can have all 64
queues defined. Mellanox has its limit at 8. But it still is being
reset 64 times. We mostly decrease the number of queues to 4.
Sometimes 2 according to the CPU used. Yet every CPU had to handle 64
resets.

With the attached patch I'm down to 1.7 seconds - more than 90%
improvement :-) Can you please check it and pass it to proper places?
According to debugging printk messages it empties only active queues.

Thank you for all your help and effort.


>
> I plan to address this later on top of the existing patches.
>
> Thanks.

[-- Attachment #2: netdevice_num_tx_queues.patch --]
[-- Type: text/x-patch, Size: 300 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2135,7 +2135,7 @@
 {
 	unsigned int i;
 
-	for (i = 0; i < dev->num_tx_queues; i++)
+	for (i = 0; i < dev->real_num_tx_queues; i++)
 		f(dev, &dev->_tx[i], arg);
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-18 14:16                                                 ` Václav Zindulka
@ 2020-05-18 18:22                                                   ` Cong Wang
  2020-05-19  8:04                                                     ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-05-18 18:22 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Mon, May 18, 2020 at 7:16 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Sun, May 17, 2020 at 9:35 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
> > <vaclav.zindulka@tlapnet.cz> wrote:
> > > > > >
> > > > > > I tried to emulate your test case in my VM, here is the script I use:
> > > > > >
> > > > > > ====
> > > > > > ip li set dev dummy0 up
> > > > > > tc qd add dev dummy0 root handle 1: htb default 1
> > > > > > for i in `seq 1 1000`
> > > > > > do
> > > > > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > > > > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > > > > done
> > > > > >
> > > > > > time tc qd del dev dummy0 root
> > > > > > ====
> > > > > >
> > > > > > And this is the result:
> > > > > >
> > > > > >     Before my patch:
> > > > > >      real   0m0.488s
> > > > > >      user   0m0.000s
> > > > > >      sys    0m0.325s
> > > > > >
> > > > > >     After my patch:
> > > > > >      real   0m0.180s
> > > > > >      user   0m0.000s
> > > > > >      sys    0m0.132s
> > > > >
> > > > > My results with your test script.
> > > > >
> > > > > before patch:
> > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > > real 1.63
> > > > > user 0.00
> > > > > sys 1.63
> > > > >
> > > > > after patch:
> > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > > real 1.55
> > > > > user 0.00
> > > > > sys 1.54
> > > > >
> > > > > > This is an obvious improvement, so I have no idea why you didn't
> > > > > > catch any difference.
> > > > >
> > > > > We use hfsc instead of htb. I don't know whether it may cause any
> > > > > difference. I can provide you with my test scripts if necessary.
> > > >
> > > > Yeah, you can try to replace the htb with hfsc in my script,
> > > > I didn't spend time to figure out hfsc parameters.
> > >
> > > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
> > > 13107200 ul m1 0 d 0 m2 13107200
> > >
> > > but it behaves the same as htb...
> > >
> > > > My point here is, if I can see the difference with merely 1000
> > > > tc classes, you should see a bigger difference with hundreds
> > > > of thousands classes in your setup. So, I don't know why you
> > > > saw a relatively smaller difference.
> > >
> > > I saw a relatively big difference. It was about 1.5s faster on my huge
> > > setup which is a lot. Yet maybe the problem is caused by something
> >
> > What percentage? IIUC, without patch it took you about 11s, so
> > 1.5s faster means 13% improvement for you?
>
> My whole setup needs 22.17 seconds to delete with an unpatched kernel.
> With your patches applied it is 21.08. So it varies between 1 - 1.5s.
> Improvement is about 5 - 6%.

Good to know.

>
> > > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
> > > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
> > > queues. I've tried to increase the number of queues using ethtool from
> > > 4 to 8 and decreased to 2. But there was no difference. It was about
> > > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
> > > patches applied. I've tried it for ifb and RJ45 interfaces where it
> > > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
> > > patches applied, which is strange, but it may be caused by the fact it
> > > was very fast even before.
> >
> > That is odd. In fact, this is highly related to number of TX queues,
> > because the existing code resets the qdisc's once for each TX
> > queue, so the more TX queues you have, the more resets kernel
> > will do, that is the more time it will take.
>
> Can't the problem be caused that reset is done on active and inactive
> queues every time? It would explain why it had no effect in decreasing
> and increasing the number of active queues. Yet it doesn't explain why
> Intel card (82599ES) with 64 possible queues has exactly the same
> problem as Mellanox (ConnectX-4 LX) with 8 possible queues.

Regardless of these queues, the qdisc's should be only reset once,
because all of these queues point to the same instance of root
qdisc in your case.

[...]
> With the attached patch I'm down to 1.7 seconds - more than 90%
> improvement :-) Can you please check it and pass it to proper places?
> According to debugging printk messages it empties only active queues.

You can't change netdev_for_each_tx_queue(), it would certainly at least
break netif_alloc_netdev_queues().

Let me think how to fix this properly, I have some ideas and will provide
you some patch(es) to test soon.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-18 18:22                                                   ` Cong Wang
@ 2020-05-19  8:04                                                     ` Václav Zindulka
  2020-05-19 17:56                                                       ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-05-19  8:04 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Mon, May 18, 2020 at 8:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 7:16 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> >
> > On Sun, May 17, 2020 at 9:35 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
> > > <vaclav.zindulka@tlapnet.cz> wrote:
> > > > > > >
> > > > > > > I tried to emulate your test case in my VM, here is the script I use:
> > > > > > >
> > > > > > > ====
> > > > > > > ip li set dev dummy0 up
> > > > > > > tc qd add dev dummy0 root handle 1: htb default 1
> > > > > > > for i in `seq 1 1000`
> > > > > > > do
> > > > > > >   tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
> > > > > > >   tc qd add dev dummy0 parent 1:$i fq_codel
> > > > > > > done
> > > > > > >
> > > > > > > time tc qd del dev dummy0 root
> > > > > > > ====
> > > > > > >
> > > > > > > And this is the result:
> > > > > > >
> > > > > > >     Before my patch:
> > > > > > >      real   0m0.488s
> > > > > > >      user   0m0.000s
> > > > > > >      sys    0m0.325s
> > > > > > >
> > > > > > >     After my patch:
> > > > > > >      real   0m0.180s
> > > > > > >      user   0m0.000s
> > > > > > >      sys    0m0.132s
> > > > > >
> > > > > > My results with your test script.
> > > > > >
> > > > > > before patch:
> > > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > > > real 1.63
> > > > > > user 0.00
> > > > > > sys 1.63
> > > > > >
> > > > > > after patch:
> > > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root
> > > > > > real 1.55
> > > > > > user 0.00
> > > > > > sys 1.54
> > > > > >
> > > > > > > This is an obvious improvement, so I have no idea why you didn't
> > > > > > > catch any difference.
> > > > > >
> > > > > > We use hfsc instead of htb. I don't know whether it may cause any
> > > > > > difference. I can provide you with my test scripts if necessary.
> > > > >
> > > > > Yeah, you can try to replace the htb with hfsc in my script,
> > > > > I didn't spend time to figure out hfsc parameters.
> > > >
> > > > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
> > > > 13107200 ul m1 0 d 0 m2 13107200
> > > >
> > > > but it behaves the same as htb...
> > > >
> > > > > My point here is, if I can see the difference with merely 1000
> > > > > tc classes, you should see a bigger difference with hundreds
> > > > > of thousands classes in your setup. So, I don't know why you
> > > > > saw a relatively smaller difference.
> > > >
> > > > I saw a relatively big difference. It was about 1.5s faster on my huge
> > > > setup which is a lot. Yet maybe the problem is caused by something
> > >
> > > What percentage? IIUC, without patch it took you about 11s, so
> > > 1.5s faster means 13% improvement for you?
> >
> > My whole setup needs 22.17 seconds to delete with an unpatched kernel.
> > With your patches applied it is 21.08. So it varies between 1 - 1.5s.
> > Improvement is about 5 - 6%.
>
> Good to know.
>
> >
> > > > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
> > > > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
> > > > queues. I've tried to increase the number of queues using ethtool from
> > > > 4 to 8 and decreased to 2. But there was no difference. It was about
> > > > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
> > > > patches applied. I've tried it for ifb and RJ45 interfaces where it
> > > > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
> > > > patches applied, which is strange, but it may be caused by the fact it
> > > > was very fast even before.
> > >
> > > That is odd. In fact, this is highly related to number of TX queues,
> > > because the existing code resets the qdisc's once for each TX
> > > queue, so the more TX queues you have, the more resets kernel
> > > will do, that is the more time it will take.
> >
> > Can't the problem be caused that reset is done on active and inactive
> > queues every time? It would explain why it had no effect in decreasing
> > and increasing the number of active queues. Yet it doesn't explain why
> > Intel card (82599ES) with 64 possible queues has exactly the same
> > problem as Mellanox (ConnectX-4 LX) with 8 possible queues.
>
> Regardless of these queues, the qdisc's should be only reset once,
> because all of these queues point to the same instance of root
> qdisc in your case.
>
> [...]
> > With the attached patch I'm down to 1.7 seconds - more than 90%
> > improvement :-) Can you please check it and pass it to proper places?
> > According to debugging printk messages it empties only active queues.
>
> You can't change netdev_for_each_tx_queue(), it would certainly at least
> break netif_alloc_netdev_queues().

You are right. I didn't check this. But that is the only one occurence
of netdev_for_each_tx_queue() except sch_generic.c. I've duplicated
netdev_for_each_tx_queue() to netdev_for_each_active_tx_queue() with
real_num_tx_queues in for loop and altered sch_generic.c where I
replaced all the occurences with the new function. From my point of
view it is easier to fix unnecessary excessive calls for unallocated
queues. Yet I totally understand your point of view which would mean a
cleaner approach. I'm more than happy with that small fix. I'll help
you test all the patches necessary yet our production will run on that
dirty fix. You know, my first kernel patch. :-D If you want I'll send
it to you.

>
> Let me think how to fix this properly, I have some ideas and will provide
> you some patch(es) to test soon.

Sure, I'll wait. I have plenty of time now with the main problem fixed :-)

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-19  8:04                                                     ` Václav Zindulka
@ 2020-05-19 17:56                                                       ` Cong Wang
  2020-05-24 10:03                                                         ` Václav Zindulka
  0 siblings, 1 reply; 32+ messages in thread
From: Cong Wang @ 2020-05-19 17:56 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Tue, May 19, 2020 at 1:04 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
> >
> > Let me think how to fix this properly, I have some ideas and will provide
> > you some patch(es) to test soon.
>
> Sure, I'll wait. I have plenty of time now with the main problem fixed :-)

Can you help to test the patches below?
https://github.com/congwang/linux/commits/qdisc_reset2

I removed the last patch you previously tested and added two
more patches on top. These two patches should get rid of most
of the unnecessary resets. I tested them with veth pair with 8 TX
queues, but I don't have any real physical NIC to test.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-19 17:56                                                       ` Cong Wang
@ 2020-05-24 10:03                                                         ` Václav Zindulka
  2020-05-26 18:37                                                           ` Cong Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Václav Zindulka @ 2020-05-24 10:03 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers

On Tue, May 19, 2020 at 7:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 1:04 AM Václav Zindulka
> <vaclav.zindulka@tlapnet.cz> wrote:
> > >
> > > Let me think how to fix this properly, I have some ideas and will provide
> > > you some patch(es) to test soon.
> >
> > Sure, I'll wait. I have plenty of time now with the main problem fixed :-)
>
> Can you help to test the patches below?
> https://github.com/congwang/linux/commits/qdisc_reset2
>
> I removed the last patch you previously tested and added two
> more patches on top. These two patches should get rid of most
> of the unnecessary resets. I tested them with veth pair with 8 TX
> queues, but I don't have any real physical NIC to test.
>
> Thanks.

I've tested it and I have to admit that your Kung-fu is even better
than mine :-D My large ruleset with over 13k qdiscs defined got from
22s to 520ms. I've tested downtime of interface during deletion of
root qdisc and it corresponds to the time I measured so it is great.

/usr/bin/time -p tc qdisc del dev enp1s0f0np0 root
real 0.52
user 0.00
sys 0.52

I've even added my patch for only active queues but it didn't make any
difference from your awesome patch :-)

Thank you very much. You helped me really a lot.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: iproute2: tc deletion freezes whole server
  2020-05-24 10:03                                                         ` Václav Zindulka
@ 2020-05-26 18:37                                                           ` Cong Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Cong Wang @ 2020-05-26 18:37 UTC (permalink / raw)
  To: Václav Zindulka; +Cc: Linux Kernel Network Developers

On Sun, May 24, 2020 at 3:04 AM Václav Zindulka
<vaclav.zindulka@tlapnet.cz> wrote:
>
> On Tue, May 19, 2020 at 7:57 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Tue, May 19, 2020 at 1:04 AM Václav Zindulka
> > <vaclav.zindulka@tlapnet.cz> wrote:
> > > >
> > > > Let me think how to fix this properly, I have some ideas and will provide
> > > > you some patch(es) to test soon.
> > >
> > > Sure, I'll wait. I have plenty of time now with the main problem fixed :-)
> >
> > Can you help to test the patches below?
> > https://github.com/congwang/linux/commits/qdisc_reset2
> >
> > I removed the last patch you previously tested and added two
> > more patches on top. These two patches should get rid of most
> > of the unnecessary resets. I tested them with veth pair with 8 TX
> > queues, but I don't have any real physical NIC to test.
> >
> > Thanks.
>
> I've tested it and I have to admit that your Kung-fu is even better
> than mine :-D My large ruleset with over 13k qdiscs defined got from
> 22s to 520ms. I've tested downtime of interface during deletion of
> root qdisc and it corresponds to the time I measured so it is great.
>
> /usr/bin/time -p tc qdisc del dev enp1s0f0np0 root
> real 0.52
> user 0.00
> sys 0.52
>
> I've even added my patch for only active queues but it didn't make any
> difference from your awesome patch :-)

Great! I will send out the patches formally with your Tested-by.


>
> Thank you very much. You helped me really a lot.

Thanks for testing!

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-05-26 18:37 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-22 18:06 iproute2: tc deletion freezes whole server Václav Zindulka
2020-03-23 18:17 ` Cong Wang
2020-03-24 16:27   ` Václav Zindulka
2020-03-24 22:57     ` Cong Wang
2020-03-25 11:27       ` Václav Zindulka
2020-03-25 13:33         ` Václav Zindulka
2020-03-25 17:43         ` Cong Wang
2020-03-25 18:23           ` Václav Zindulka
2020-03-26 14:24             ` Václav Zindulka
2020-03-26 17:07               ` Cong Wang
2020-03-26 17:38                 ` Cong Wang
2020-03-27 10:35                   ` Václav Zindulka
2020-03-28 13:04                     ` Václav Zindulka
2020-03-31  5:59                       ` Cong Wang
2020-03-31 12:46                         ` Václav Zindulka
2020-04-08 20:18                           ` Cong Wang
2020-04-12 20:17                             ` Václav Zindulka
2020-04-13 17:28                               ` Cong Wang
2020-04-15 15:01                                 ` Václav Zindulka
2020-04-30 12:40                                   ` Václav Zindulka
2020-05-04 17:46                                     ` Cong Wang
2020-05-04 20:36                                       ` Cong Wang
2020-05-05  8:46                                         ` Václav Zindulka
2020-05-07 18:52                                           ` Cong Wang
2020-05-08 13:59                                             ` Václav Zindulka
2020-05-17 19:35                                               ` Cong Wang
2020-05-18 14:16                                                 ` Václav Zindulka
2020-05-18 18:22                                                   ` Cong Wang
2020-05-19  8:04                                                     ` Václav Zindulka
2020-05-19 17:56                                                       ` Cong Wang
2020-05-24 10:03                                                         ` Václav Zindulka
2020-05-26 18:37                                                           ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.