All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] add an initial version of snmp_counter.rst
@ 2018-11-09 18:13 yupeng
  2018-11-10  2:20 ` Cong Wang
  2018-11-10 16:56 ` Randy Dunlap
  0 siblings, 2 replies; 3+ messages in thread
From: yupeng @ 2018-11-09 18:13 UTC (permalink / raw)
  To: netdev, xiyou.wangcong

The snmp_counter.rst run a set of simple experiments, explains the
meaning of snmp counters depend on the experiments' results. This is
an initial version, only covers a small part of the snmp counters.

Signed-off-by: yupeng <yupeng0921@gmail.com>
---
 Documentation/networking/index.rst        |   1 +
 Documentation/networking/snmp_counter.rst | 963 ++++++++++++++++++++++
 2 files changed, 964 insertions(+)
 create mode 100644 Documentation/networking/snmp_counter.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index bd89dae8d578..6a47629ef8ed 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -31,6 +31,7 @@ Contents:
    net_failover
    alias
    bridge
+   snmp_counter
 
 .. only::  subproject
 
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
new file mode 100644
index 000000000000..2939c5acf675
--- /dev/null
+++ b/Documentation/networking/snmp_counter.rst
@@ -0,0 +1,963 @@
+====================
+snmp counter tutorial
+====================
+
+This document explains the meaning of snmp counters. For understanding
+their meanings better, this document doesn't explain the counters one
+by one, but creates a set of experiments, and explains the counters
+depend on the experiments' results. The experiments are on one or two
+virtual machines. Except for the test commands we use in the experiments,
+the virtual machines have no other network traffic. We use the 'nstat'
+command to get the values of snmp counters, before every test, we run
+'nstat -n' to update the history, so the 'nstat' output would only
+show the changes of the snmp counters. For more information about
+nstat, please refer:
+
+http://man7.org/linux/man-pages/man8/nstat.8.html
+
+icmp ping
+========
+
+Run the ping command against the public dns server 8.8.8.8::
+
+  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
+  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
+  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
+    
+  --- 8.8.8.8 ping statistics ---
+  1 packets transmitted, 1 received, 0% packet loss, time 0ms
+  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
+
+The nstayt result::
+
+  nstatuser@nstat-a:~$ nstat
+  #kernel
+  IpInReceives                    1                  0.0
+  IpInDelivers                    1                  0.0
+  IpOutRequests                   1                  0.0
+  IcmpInMsgs                      1                  0.0
+  IcmpInEchoReps                  1                  0.0
+  IcmpOutMsgs                     1                  0.0
+  IcmpOutEchos                    1                  0.0
+  IcmpMsgInType0                  1                  0.0
+  IcmpMsgOutType8                 1                  0.0
+  IpExtInOctets                   84                 0.0
+  IpExtOutOctets                  84                 0.0
+  IpExtInNoECTPkts                1                  0.0
+
+The nstat output could be divided into two part: one with the 'Ext'
+keyword, another without the 'Ext' keyword. If the counter name
+doesn't have 'Ext', it is defined by one of snmp rfc, if it has 'Ext',
+it is a kernel extent counter. Below we explain them one by one.
+
+The rfc defined counters
+----------------------
+
+* IpInReceives
+The total number of input datagrams received from interfaces,
+including those received in error.
+
+https://tools.ietf.org/html/rfc1213#page-26
+
+* IpInDelivers
+The total number of input datagrams successfully delivered to IP
+user-protocols (including ICMP).
+
+https://tools.ietf.org/html/rfc1213#page-28
+
+* IpOutRequests
+The total number of IP datagrams which local IP user-protocols
+(including ICMP) supplied to IP in requests for transmission.  Note
+that this counter does not include any datagrams counted in
+ipForwDatagrams.
+
+https://tools.ietf.org/html/rfc1213#page-28
+
+* IcmpInMsgs
+The total number of ICMP messages which the entity received.  Note
+that this counter includes all those counted by icmpInErrors.
+
+https://tools.ietf.org/html/rfc1213#page-41
+
+* IcmpInEchoReps
+The number of ICMP Echo Reply messages received.
+
+https://tools.ietf.org/html/rfc1213#page-42
+
+* IcmpOutMsgs
+The total number of ICMP messages which this entity attempted to send.
+Note that this counter includes all those counted by icmpOutErrors.
+
+https://tools.ietf.org/html/rfc1213#page-43
+
+* IcmpOutEchos
+The number of ICMP Echo (request) messages sent.
+
+https://tools.ietf.org/html/rfc1213#page-45
+
+IcmpMsgInType0 and IcmpMsgOutType8 are not defined by any snmp related
+RFCs, but their meaning are quite straightforward, they count the
+packet number of specific icmp packet types. We could find the icmp
+types here:
+
+https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
+
+Type 8 is echo, type 0 is echo reply.
+
+Until now, we can easily explain these items of the nstat: We sent an
+icmp echo request, so IpOutRequests, IcmpOutMsgs, IcmpOutEchos and
+IcmpMsgOutType8 were increased 1. We got icmp echo reply from 8.8.8.8,
+so IpInReceives, IcmpInMsgs, IcmpInEchoReps, IcmpMsgInType0 were
+increased 1. The icmp echo reply was passed to icmp layer via ip
+layer, so IpInDelivers was increased 1.
+
+Please note, these metrics don't aware LRO/GRO, e.g., IpOutRequests
+might count 1 packet, but hardware splits it to 2, and sends them
+separately.
+
+IpExtInOctets and IpExtOutOctets
+------------------------------
+They are linux kernel extensions, no rfc definitions. Please note,
+rfc1213 indeed defines ifInOctets  and ifOutOctets, but they
+are different things. The ifInOctets and ifOutOctets are packets
+size which includes the mac layer. But IpExtInOctets and IpExtOutOctets
+are only ip layer sizes.
+
+In our example, an ICMP echo request has four parts:
+* 14 bytes mac header
+* 20 bytes ip header
+* 16 bytes icmp header
+* 48 bytes data (default value of the ping command)
+
+So IpExtInOctets value is 20+16+48=84. The IpExtOutOctets is similar.
+
+IpExtInNoECTPkts
+---------------
+We could find IpExtInNoECTPkts in the nstat output, but kernel provide
+four similar counters, we explain them together, they are:
+* IpExtInNoECTPkts
+* IpExtInECT1Pkts
+* IpExtInECT0Pkts
+* IpExtInCEPkts
+
+They indicate four kinds of ECN IP packets, they are defined here:
+
+https://tools.ietf.org/html/rfc3168#page-6
+
+These 4 counters calculate how many packets received per ECN
+status. They count the real frame number regardless the LRO/GRO. So
+for the same packet, you might find that IpInReceives count 1, but
+IpExtInNoECTPkts counts 2 or more.
+
+additional explain
+-----------------
+The ip layer counters are recorded by the ip layer code in the kernel. I mean, if you send a packet to a lower layer directly, Linux
+kernel won't record it. For example, tcpreplay will open an
+AF_PACKET socket, and send the packet to layer 2, although it could send
+an IP packet, you can't find it from the nstat output. Here is an
+example:
+
+We capture the ping packet by tcpdump::
+
+  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/ping.pcap dst 8.8.8.8
+
+Then run ping command::
+
+  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
+
+Terminate tcpdump by Ctrl-C, and run 'nstat -n' to update the nstat
+history. Then run tcpreplay::
+
+  nstatuser@nstat-a:~$ sudo tcpreplay --intf1=ens3 /tmp/ping.pcap
+  Actual: 1 packets (98 bytes) sent in 0.000278 seconds
+  Rated: 352517.9 Bps, 2.82 Mbps, 3597.12 pps
+  Flows: 1 flows, 3597.12 fps, 1 flow packets, 0 non-flow
+  Statistics for network device: ens3
+          Successful packets:        1
+          Failed packets:            0
+          Truncated packets:         0
+          Retried packets (ENOBUFS): 0
+          Retried packets (EAGAIN):  0
+
+Check the nstat output::
+
+  nstatuser@nstat-a:~$ nstat
+  #kernel
+  IpInReceives                    1                  0.0
+  IpInDelivers                    1                  0.0
+  IcmpInMsgs                      1                  0.0
+  IcmpInEchoReps                  1                  0.0
+  IcmpMsgInType0                  1                  0.0
+  IpExtInOctets                   84                 0.0
+  IpExtInNoECTPkts                1                  0.0
+
+We can see, nstat only show the received packet, because the IP layer
+of kernel only know the reply of 8.8.8.8, it doesn't know what
+tcpreplay sent.
+
+At the same time, when you use AF_INET socket, even you use the
+SOCK_RAW option, the IP layer will still try to verify whether the
+packet is an ICMP packet, if it is, kernel will still count it to its
+counters and you can find it in the output of nstat.
+
+tcp 3 way handshake
+==================
+
+On server side, we run::
+
+  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
+  Listening on [0.0.0.0] (family 0, port 9000)
+
+On client side, we run::
+
+  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
+  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
+
+The server listened on tcp 9000 port, the client connected to it, they
+completed the 3-way handshake.
+
+On server side, we can find below nstat output::
+
+  nstatuser@nstat-b:~$ nstat | grep -i tcp
+  TcpPassiveOpens                 1                  0.0
+  TcpInSegs                       2                  0.0
+  TcpOutSegs                      1                  0.0
+  TcpExtTCPPureAcks               1                  0.0
+
+On client side, we can find below nstat output::
+
+  nstatuser@nstat-a:~$ nstat | grep -i tcp
+  TcpActiveOpens                  1                  0.0
+  TcpInSegs                       1                  0.0
+  TcpOutSegs                      2                  0.0
+
+Except for TcpExtTCPPureAcks, all other counters are defined by rfc1213
+
+* TcpActiveOpens
+The number of times TCP connections have made a direct transition to
+the SYN-SENT state from the CLOSED state.
+
+https://tools.ietf.org/html/rfc1213#page-47
+
+* TcpPassiveOpens
+The number of times TCP connections have made a direct transition to
+the SYN-RCVD state from the LISTEN state.
+
+https://tools.ietf.org/html/rfc1213#page-47
+
+* TcpInSegs
+The total number of segments received, including those received in
+error.  This count includes segments received on currently established
+connections.
+
+https://tools.ietf.org/html/rfc1213#page-48
+
+* TcpOutSegs
+The total number of segments sent, including those on current
+connections but excluding those containing only retransmitted octets.
+
+https://tools.ietf.org/html/rfc1213#page-48
+
+
+The TcpExtTCPPureAcks is an extension in linux kernel. When kernel
+receives a TCP packet which set ACK flag and with no data, either
+TcpExtTCPPureAcks or TcpExtTCPHPAcks will increase 1. We will discuss
+it in a later section.
+
+Now we can easily explain the nstat outputs on the server side and client
+side.
+
+When the server received the first syn, it replied a syn+ack, and came into
+SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
+syn, sent syn+ack, received ack, so server sent 1 packet, received 2
+packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ack
+of the 3-way handshake is a pure ack without data, so
+TcpExtTCPPureAcks increased 1.
+
+When the client sent syn, the client came into the SYN-SENT state, so
+TcpActiveOpens increased 1, client sent syn, received syn+ack, sent
+ack, so client sent 2 packets, received 1 packet, TcpInSegs increased
+1, TcpOutSegs increased 2.
+
+Note: about TcpInSegs and TcpOutSegs, rfc1213 doesn't define the
+behaviors when gso/gro/tso are enabled on the NIC (network interface
+card). On current linux implementation, TcpOutSegs awares gso/tso, but
+TcpInSegs doesn't aware gro. So TcpOutSegs will count the actual
+packet number even only 1 packet is sent via tcp layer. If multiple
+packets arrived at a NIC, and they are merged into 1 packet, TcpInSegs
+will only count 1.
+
+tcp disconnect
+=============
+
+Continue our previous example, on the server side, we have run::
+
+  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
+  Listening on [0.0.0.0] (family 0, port 9000)
+
+On client side, we have run::
+
+  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
+  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
+
+Now we type Ctrl-C on the client side, stop the tcp connection between the
+two nc command. Then we check the nstat output.
+
+On server side::
+
+  nstatuser@nstat-b:~$ nstat | grep -i tcp
+  TcpInSegs                       2                  0.0
+  TcpOutSegs                      1                  0.0
+  TcpExtTCPPureAcks               1                  0.0
+  TcpExtTCPOrigDataSent           1                  0.0
+
+On client side::
+
+  nstatuser@nstat-b:~$ nstat | grep -i tcp
+  TcpInSegs                       2                  0.0
+  TcpOutSegs                      1                  0.0
+  TcpExtTCPPureAcks               1                  0.0
+  TcpExtTCPOrigDataSent           1                  0.0
+
+Wait for more than 1 minute, run nstat on client again::
+
+  nstatuser@nstat-a:~$ nstat | grep -i tcp
+  TcpExtTW                        1                  0.0
+
+Most of the counters are explained in the previous section except
+two: TcpExtTCPOrigDataSent and TcpExtTW. Both of them are linux kernel
+extensions.
+
+TcpExtTW means a tcp connection is closed normally via
+time wait stage, not via tcp reuse process.
+
+About TcpExtTCPOrigDataSent, Below kernel patch has a good explanation:
+
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
+
+I pasted it here::
+
+  TCPOrigDataSent: number of outgoing packets with original data
+  (excluding retransmission but including data-in-SYN). This counter is
+  different from TcpOutSegs because TcpOutSegs also tracks pure
+  ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate.
+
+the effect of gso and gro
+=======================
+
+The Generic Segmentation Offload (GSO) and Generic Receive Offload
+would affect the metrics of the packet in/out on both ip and tcp
+layer. Here is an iperf example. Before the test, run below command to
+make sure both gso and gro are enabled on the NIC::
+
+  $ sudo ethtool -k ens3 | egrep '(generic-segmentation-offload|generic-receive-offload)'
+  generic-segmentation-offload: on
+  generic-receive-offload: on
+
+On server side, run::
+
+  iperf3 -s -p 9000
+
+On client side, run::
+
+  iperf3 -c 192.168.122.251 -p 9000 -t 5 -P 10
+
+The server listened on tcp port 9000, the client connected to the server,
+created 10 threads parallel, run 5 seconds. After the pierf3 stopped, we
+run nstat on both the server and client.
+
+On server side::
+
+  nstatuser@nstat-b:~$ nstat
+  #kernel
+  IpInReceives                    36346              0.0
+  IpInDelivers                    36346              0.0
+  IpOutRequests                   33836              0.0
+  TcpPassiveOpens                 11                 0.0
+  TcpEstabResets                  2                  0.0
+  TcpInSegs                       36346              0.0
+  TcpOutSegs                      33836              0.0
+  TcpOutRsts                      20                 0.0
+  TcpExtDelayedACKs               26                 0.0
+  TcpExtTCPHPHits                 32120              0.0
+  TcpExtTCPPureAcks               16                 0.0
+  TcpExtTCPHPAcks                 5                  0.0
+  TcpExtTCPAbortOnData            5                  0.0
+  TcpExtTCPAbortOnClose           2                  0.0
+  TcpExtTCPRcvCoalesce            7306               0.0
+  TcpExtTCPOFOQueue               1354               0.0
+  TcpExtTCPOrigDataSent           15                 0.0
+  IpExtInOctets                   311732432          0.0
+  IpExtOutOctets                  1785119            0.0
+  IpExtInNoECTPkts                214032             0.0
+
+Client side::
+
+  nstatuser@nstat-a:~$ nstat
+  #kernel
+  IpInReceives                    33836              0.0
+  IpInDelivers                    33836              0.0
+  IpOutRequests                   43786              0.0
+  TcpActiveOpens                  11                 0.0
+  TcpEstabResets                  10                 0.0
+  TcpInSegs                       33836              0.0
+  TcpOutSegs                      214072             0.0
+  TcpRetransSegs                  3876               0.0
+  TcpExtDelayedACKs               7                  0.0
+  TcpExtTCPHPHits                 5                  0.0
+  TcpExtTCPPureAcks               2719               0.0
+  TcpExtTCPHPAcks                 31071              0.0
+  TcpExtTCPSackRecovery           607                0.0
+  TcpExtTCPSACKReorder            61                 0.0
+  TcpExtTCPLostRetransmit         90                 0.0
+  TcpExtTCPFastRetrans            3806               0.0
+  TcpExtTCPSlowStartRetrans       62                 0.0
+  TcpExtTCPLossProbes             38                 0.0
+  TcpExtTCPSackRecoveryFail       8                  0.0
+  TcpExtTCPSackShifted            203                0.0
+  TcpExtTCPSackMerged             778                0.0
+  TcpExtTCPSackShiftFallback      700                0.0
+  TcpExtTCPSpuriousRtxHostQueues  4                  0.0
+  TcpExtTCPAutoCorking            14                 0.0
+  TcpExtTCPOrigDataSent           214038             0.0
+  TcpExtTCPHystartTrainDetect     8                  0.0
+  TcpExtTCPHystartTrainCwnd       172                0.0
+  IpExtInOctets                   1785227            0.0
+  IpExtOutOctets                  317789680          0.0
+  IpExtInNoECTPkts                33836              0.0
+
+The TcpOutSegs and IpOutRequests on the server are 33836, exactly the
+same as IpExtInNoECTPkts, IpInReceives, IpInDelivers and TcpInSegs on
+the client side. During iperf3 test, the server only reply very short
+packets, so gso and gro has no effect on the server's reply.
+
+On the client side, TcpOutSegs is 214072, IpOutRequests is 43786, the
+tcp layer packet out is larger than ip layer packet out, because
+TcpOutSegs count the packet number after gso, but IpOutRequests
+doesn't. On the server side, IpExtInNoECTPkts is 214032, this number
+is smaller a little than the TcpOutSegs on the client side (214072), it
+might cause by the packet loss. The IpInReceives, IpInDelivers and
+TcpInSegs are obviously smaller than the TcpOutSegs on the client side,
+because these counters calculate the packet after gro.
+
+tcp counters in established state
+================================
+
+Run nc on server::
+
+  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
+  Listening on [0.0.0.0] (family 0, port 9000)
+
+Run nc on client:
+
+  nstatuser@nstat-a:~$ nc -v nstat-b 9000
+  Connection to nstat-b 9000 port [tcp/*] succeeded!
+
+Input a string in the nc client ('hello' in our example):
+
+  nstatuser@nstat-a:~$ nc -v nstat-b 9000
+  Connection to nstat-b 9000 port [tcp/*] succeeded!
+  hello
+
+The client side nstat output:
+
+  nstatuser@nstat-a:~$ nstat
+  #kernel
+  IpInReceives                    1                  0.0
+  IpInDelivers                    1                  0.0
+  IpOutRequests                   1                  0.0
+  TcpInSegs                       1                  0.0
+  TcpOutSegs                      1                  0.0
+  TcpExtTCPPureAcks               1                  0.0
+  TcpExtTCPOrigDataSent           1                  0.0
+  IpExtInOctets                   52                 0.0
+  IpExtOutOctets                  58                 0.0
+  IpExtInNoECTPkts                1                  0.0
+
+The server side nstat output:
+
+  nstatuser@nstat-b:~$ nstat
+  #kernel
+  IpInReceives                    1                  0.0
+  IpInDelivers                    1                  0.0
+  IpOutRequests                   1                  0.0
+  TcpInSegs                       1                  0.0
+  TcpOutSegs                      1                  0.0
+  IpExtInOctets                   58                 0.0
+  IpExtOutOctets                  52                 0.0
+  IpExtInNoECTPkts                1                  0.0
+
+Input a string in nc client side again ('world' in our exmaple):
+
+  nstatuser@nstat-a:~$ nc -v nstat-b 9000
+  Connection to nstat-b 9000 port [tcp/*] succeeded!
+  hello
+  world
+
+Client side nstat output:
+
+  nstatuser@nstat-a:~$ nstat
+  #kernel
+  IpInReceives                    1                  0.0
+  IpInDelivers                    1                  0.0
+  IpOutRequests                   1                  0.0
+  TcpInSegs                       1                  0.0
+  TcpOutSegs                      1                  0.0
+  TcpExtTCPHPAcks                 1                  0.0
+  TcpExtTCPOrigDataSent           1                  0.0
+  IpExtInOctets                   52                 0.0
+  IpExtOutOctets                  58                 0.0
+  IpExtInNoECTPkts                1                  0.0
+
+
+Server side nstat output:
+
+  nstatuser@nstat-b:~$ nstat
+  #kernel
+  IpInReceives                    1                  0.0
+  IpInDelivers                    1                  0.0
+  IpOutRequests                   1                  0.0
+  TcpInSegs                       1                  0.0
+  TcpOutSegs                      1                  0.0
+  TcpExtTCPHPHits                 1                  0.0
+  IpExtInOctets                   58                 0.0
+  IpExtOutOctets                  52                 0.0
+  IpExtInNoECTPkts                1                  0.0
+
+Compare the first client side output and the second client side
+output, we could find one difference: the first one had a
+'TcpExtTCPPureAcks', but the second one had a
+'TcpExtTCPHPAcks'. The first server side output and the second server
+side output had a difference too: the second server side output had a
+TcpExtTCPHPHits, but the first server side output didn't have it. The
+network traffic patterns were exactly the same: the client sent a packet to the server, the server replied an ack. But kernel handled them in different
+ways. When kernel receives a tpc packet in the established status,
+kernel has two paths to handle the packet, one is fast path, another
+is slow path. The comment in kernel code provides a good explanation of
+them, I paste them below:
+
+  It is split into a fast path and a slow path. The fast path is
+  disabled when:
+  - A zero window was announced from us - zero window probing
+    is only handled properly on the slow path.
+  - Out of order segments arrived.
+  - Urgent data is expected.
+  - There is no buffer space left
+  - Unexpected TCP flags/window values/header lengths are received
+    (detected by checking the TCP header against pred_flags)
+  - Data is sent in both directions. The fast path only supports pure senders
+    or pure receivers (this means either the sequence number or the ack
+    value must stay constant)
+  - Unexpected TCP option.
+
+Kernel will try to use fast path unless any of the above conditions
+are satisfied. If the packets are out of order, kernel will handle
+them in slow path, which means the performance might be not very
+good. Kernel would also come into slow path if the "Delayed ack" is
+used, because when using "Delayed ack", the data is sent in both
+directions. When the tcp window scale option is not used, kernel will
+try to enable fast path immediately when the connection comes into the established
+state, but if the tcp window scale option is used, kernel will disable
+the fast path at first, and try to enable it after kerenl receives
+packets. We could use the 'ss' command to verify whether the window
+scale option is used. e.g. run below command on either server or
+client:
+
+  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
+  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port     
+  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000     
+             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
+
+The 'wscale:7,7' means both server and client set the window scale
+option to 7. Now we could explain the nstat output in our test:
+
+In the first nstat output of client side, the client sent a packet, server
+reply an ack, when kernel handled this ack, the fast path was not
+enabled, so the ack was counted into 'TcpExtTCPPureAcks'.
+In the second nstat output of client side, the client sent a packet again,
+and received another ack from the server, this time, the fast path is
+enabled, and the ack was qualified for fast path, so it was handled by
+the fast path, so this ack was counted into TcpExtTCPHPAcks.
+In the first nstat output of server side, the fast path was not enabled,
+so there was no 'TcpExtTCPHPHits'.
+In the second nstat output of server side, the fast path was enabled,
+and the packet received from client qualified for fast path, so it
+was counted into 'TcpExtTCPHPHits'.
+
+tcp abort
+========
+
+Some counters indicate the reaons why tcp layer want to send a rst,
+they are:
+* TcpExtTCPAbortOnData
+* TcpExtTCPAbortOnClose
+* TcpExtTCPAbortOnMemory
+* TcpExtTCPAbortOnTimeout
+* TcpExtTCPAbortOnLinger
+* TcpExtTCPAbortFailed
+
+TcpExtTCPAbortOnData
+-------------------
+
+It means tcp layer has data in flight, but need to close the
+connection. So tcp layer sends a rst to the other side, indicate the
+connection is not closed very graceful. An easy way to increase this
+counter is using the SO_LINGER option. Please refer to the SO_LINGER
+section of the socket man page:
+
+http://man7.org/linux/man-pages/man7/socket.7.html).
+
+By default, when an application closes a connection, the close function
+will return immediately and kernel will try to send the in-flight data
+async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
+to a positive number, the close function won't return immediately, but
+wait for the in-flight data are acked by the other side, the max wait
+time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
+when the application closes a connection, kernel will send an rst
+immediately, and increase the TcpExtTCPAbortOnData counter.
+
+We run nc on the server side::
+
+  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
+  Listening on [0.0.0.0] (family 0, port 9000)
+
+Run below python code on the client side::
+
+  import socket
+  import struct
+    
+  server = 'nstat-b' # server address
+  port = 9000
+    
+  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))
+  s.connect((server, port))
+  s.close()
+
+On client side, we could see TcpExtTCPAbortOnData increased::
+
+  nstatuser@nstat-a:~$ nstat | grep -i abort
+  TcpExtTCPAbortOnData            1                  0.0
+
+If we capture packet by tcpdump, we could see the client send rst
+instead of fin.
+
+
+TcpExtTCPAbortOnClose
+--------------------
+
+This counter means the tcp layer has unread data when an application
+want to close a connection.
+
+On the server side, we run below python script:
+
+  import socket
+  import time
+  
+  port = 9000
+  
+  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+  s.bind(('0.0.0.0', port))
+  s.listen(1)
+  sock, addr = s.accept()
+  while True:
+      time.sleep(9999999)
+
+This python script listen on 9000 port, but doesn't read anything from
+the connection.
+
+On the client side, we send the string "hello" by nc:
+
+  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
+
+Then, we come back to the server side, the server has received the "hello"
+packet, and tcp layer has acked this packet, but the application didn't
+read it yet. We type Ctrl-C to terminate the server script. Then we
+could find TcpExtTCPAbortOnClose increased 1 on the server side:
+
+  nstatuser@nstat-b:~$ nstat | grep -i abort
+  TcpExtTCPAbortOnClose           1                  0.0
+
+If we run tcpdump on the server side, we could find the server sent a
+rst after we type Ctrl-C.
+
+TcpExtTCPAbortOnMemory
+--------------------
+
+When an application closes a tcp connection, kernel still need to track
+the connection, let it complete the tcp disconnect process. E.g. an
+app calls the close method of a socket, kernel sends fin to the other
+side of the connection, then the app has no relationship with the
+socket any more, but kernel need to keep the socket, this socket
+becomes an orphan socket, kernel waits for the reply of the other side,
+and would come to the TIME_WAIT state finally. When kernel has no
+enough memory to keep the orphan socket, kernel would send an rst to
+the other side, and delete the socket, in such situation, kernel will
+increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
+TcpExtTCPAbortOnMemory:
+
+* the memory used by tcp protocol is higher than the third value of
+the tcp_mem. Please refer the tcp_mem section in the tcp man page:
+
+http://man7.org/linux/man-pages/man7/tcp.7.html
+
+* the orphan socket count is higher than net.ipv4.tcp_max_orphans
+
+Below is an example which let the orphan socket count be higher than
+net.ipv4.tcp_max_orphans.
+
+Change tcp_max_orphans to a smaller value on client::
+
+  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
+
+Client code (create 64 connection to server)::
+
+  nstatuser@nstat-a:~$ cat client_orphan.py
+  import socket
+  import time
+  
+  server = 'nstat-b' # server address
+  port = 9000
+  
+  count = 64
+  
+  connection_list = []
+  
+  for i in range(64):
+      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+      s.connect((server, port))
+      connection_list.append(s)
+      print("connection_count: %d" % len(connection_list))
+  
+  while True:
+      time.sleep(99999)
+
+Server code (accept 64 connection from client)::
+
+  nstatuser@nstat-b:~$ cat server_orphan.py
+  import socket
+  import time
+  
+  port = 9000
+  count = 64
+  
+  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+  s.bind(('0.0.0.0', port))
+  s.listen(count)
+  connection_list = []
+  while True:
+      sock, addr = s.accept()
+      connection_list.append((sock, addr))
+      print("connection_count: %d" % len(connection_list))
+
+Run the python scripts on server and client.
+
+On server::
+
+  python3 server_orphan.py
+
+On client::
+
+  python3 client_orphan.py
+
+Run iptables on server::
+
+  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
+
+Type Ctrl-C on client, stop client_orphan.py.
+
+Check TcpExtTCPAbortOnMemory on client::
+
+  nstatuser@nstat-a:~$ nstat | grep -i abort
+  TcpExtTCPAbortOnMemory          54                 0.0
+
+Check orphane socket count on client::
+
+  nstatuser@nstat-a:~$ ss -s
+  Total: 131 (kernel 0)
+  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
+  
+  Transport Total     IP        IPv6
+  *         0         -         -
+  RAW       1         0         1
+  UDP       1         1         0
+  TCP       14        13        1
+  INET      16        14        2
+  FRAG      0         0         0
+
+The explanation of the test: after run server_orphan.py and
+client_orphan.py, we set up 64 connections between server and
+client. Run the iptables command, the server will drop all packets from
+the client, type Ctrl-C on client_orphan.py, the system of the client
+would try to close these connections, and before they are closed
+gracefully, these connections became orphan sockets. As the iptables
+of the server blocked packets from the client, the server won't receive fin
+from the client, so all connection on clients would be stuck on FIN_WAIT_1
+stage, so they will keep as orphan sockets until timeout. We have echo
+10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
+only keep 10 orphan sockets, for all other orphan sockets, the client
+system sent rst for them and delete them. We have 64 connections, so
+the 'ss -s' command shows the system has 10 orphan sockets, and the
+value of TcpExtTCPAbortOnMemory was 54.
+
+An additional explanation about orphan socket count: You could find the
+exactly orphan socket count by the 'ss -s' command, but when kernel
+decide whither increases TcpExtTCPAbortOnMemory and sends rst, kernel
+doesn't always check the exactly orphan socket count. For increasing
+performance, kernel checks an approximate count firstly, if the
+approximate count is more than tcp_max_orphans, kernel checks the
+exact count again. So if the approximate count is less than
+tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
+would find TcpExtTCPAbortOnMemory is not increased at all. If
+tcp_max_orphans is large enough, it won't occur, but if you decrease
+tcp_max_orphans to a small value like our test, you might find this
+issue. So in our test, the client set up 64 connections although the
+tcp_max_orphans is 10. If the client only set up 11 connections, we
+can't find the change of TcpExtTCPAbortOnMemory.
+
+TcpExtTCPAbortOnTimeout
+----------------------
+This counter will increase when any of the tcp timers expire. In this
+situation, kernel won't send rst, just give up the connection.
+Continue the previous test, we wait for several minutes, because the
+iptables on the server blocked the traffic, the server wouldn't receive
+fin, and all the client's orphan sockets would timeout on the
+FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
+10 timeout on the client::
+
+  nstatuser@nstat-a:~$ nstat | grep -i abort
+  TcpExtTCPAbortOnTimeout         10                 0.0
+
+TcpExtTCPAbortOnLinger
+---------------------
+When a tcp connection comes into FIN_WAIT_2 state, instead of waiting
+for the fin packet from the other side, kernel could send a rst and
+delete the socket immediately. This is not the default behavior of
+linux kernel tcp stack, but after configuring socket option, you could
+let kernel follow this behavior. Below is an example.
+
+The server side code::
+
+  nstatuser@nstat-b:~$ cat server_linger.py
+  import socket
+  import time
+  
+  port = 9000
+  
+  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+  s.bind(('0.0.0.0', port))
+  s.listen(1)
+  sock, addr = s.accept()
+  while True:
+      time.sleep(9999999)
+
+The client side code::
+
+  nstatuser@nstat-a:~$ cat client_linger.py 
+  import socket
+  import struct
+  
+  server = 'nstat-b' # server address
+  port = 9000
+  
+  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
+  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))                   
+  s.connect((server, port))
+  s.close()
+
+Run server_linger.py on server::
+
+  nstatuser@nstat-b:~$ python3 server_linger.py 
+
+Run client_linger.py on client::
+
+  nstatuser@nstat-a:~$ python3 client_linger.py
+
+After run client_linger.py, check the output of nstat::
+
+  nstatuser@nstat-a:~$ nstat | grep -i abort
+  TcpExtTCPAbortOnLinger          1                  0.0
+
+TcpExtTCPAbortFailed
+-------------------
+The kernel tcp layer will send rst if the RFC 2525 2.17 section is satisfied:
+
+https://tools.ietf.org/html/rfc2525#page-50
+
+If an internal error occurs during this process, TcpExtTCPAbortFailed
+will be increased.
+
+TcpExtListenOverflows and TcpExtListenDrops
+========================================
+When kernel receive a syn from a client, and if the tcp accept queue
+is full, kernel will drop the syn and add 1 to TcpExtListenOverflows.
+At the same time kernel will also add 1 to TcpExtListenDrops. When
+a tcp socket is in LISTEN state, and kernel need to drop a packet,
+kernel would always add 1 to TcpExtListenDrops. So increase
+TcpExtListenOverflows would let TcpExtListenDrops increasing at the
+same time, but TcpExtListenDrops would also increase without
+TcpExtListenOverflows increasing, e.g. a memory allocation fail would
+also let TcpExtListenDrops increase.
+
+Note: The above explain bases on kernel 4.15 or above version, on an
+old kernel, the tcp stack has different behavior when tcp accept queue
+is full. On the old kernel, tcp stack won't drop the syn, it would
+complete the 3-way handshake, but as the accept queue is full, tcp
+stack will keep the socket in the tcp half-open queue. As it is in the
+half open queue, tcp stack will send syn+ack on an exponential backoff
+timer, after client replies ack, tcp stack checks whether the accept
+queue is still full, if it is not full, move the socket to accept
+queue, if it is full, keeps the socket in the half-open queue, at next
+time client replies ack, this socket will get another chance to move
+to the accept queue.
+
+Here is an example:
+
+On server, run the nc command, listen on port 9000::
+
+  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
+  Listening on [0.0.0.0] (family 0, port 9000)
+
+On client, run 3 nc commands in different terminals::
+
+  nstatuser@nstat-a:~$ nc -v nstat-b 9000
+  Connection to nstat-b 9000 port [tcp/*] succeeded!
+
+The nc command only accepts 1 connection, and the accept queue length
+is 1. On current linux implementation, set queue length to n means the
+actual queue length is n+1. Now we create 3 connections, 1 is accepted
+by nc, 2 in accepted queue, so the accept queue is full.
+
+Before running the 4th nc, we clean the nstat history on the server:
+
+  nstatuser@nstat-b:~$ nstat -n
+
+Run the 4th nc on the client:
+
+  nstatuser@nstat-a:~$ nc -v nstat-b 9000
+
+If the nc server is running on kernel 4.15 or higher version, you
+won't see the "Connection to ... succeeded!" string, because kernel
+will drop the syn if the accept queue is full. If the nc client is running
+on an old kernel, you could see that the connection is succeeded,
+because kernel would complete the 3-way handshake and keep the socket
+on the half-open queue.
+
+Our test is on kernel 4.15, run nstat on the server:
+
+  nstatuser@nstat-b:~$ nstat
+  #kernel
+  IpInReceives                    4                  0.0
+  IpInDelivers                    4                  0.0
+  TcpInSegs                       4                  0.0
+  TcpExtListenOverflows           4                  0.0
+  TcpExtListenDrops               4                  0.0
+  IpExtInOctets                   240                0.0
+  IpExtInNoECTPkts                4                  0.0
+
+We can see both TcpExtListenOverflows and TcpExtListenDrops are 4. If
+the time between the 4th nc and the nstat is longer, the value of
+TcpExtListenOverflows and TcpExtListenDrops will be larger, because
+the syn of the 4th nc is dropped, it keeps retrying.
+
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] add an initial version of snmp_counter.rst
  2018-11-09 18:13 [PATCH] add an initial version of snmp_counter.rst yupeng
@ 2018-11-10  2:20 ` Cong Wang
  2018-11-10 16:56 ` Randy Dunlap
  1 sibling, 0 replies; 3+ messages in thread
From: Cong Wang @ 2018-11-10  2:20 UTC (permalink / raw)
  To: peng yu; +Cc: Linux Kernel Network Developers, Randy Dunlap

(Cc Randy)

On Fri, Nov 9, 2018 at 10:13 AM yupeng <yupeng0921@gmail.com> wrote:
>
> The snmp_counter.rst run a set of simple experiments, explains the
> meaning of snmp counters depend on the experiments' results. This is
> an initial version, only covers a small part of the snmp counters.


I don't look into much details, so just a few high-level reviews:

1. Please try to group those counters by protocol, it would be easier
to search.

2. For many counters you provide a link to RFC, do you just copy
and paste them? Please try to expand.

3. _I think_ you don't need to show, for example, how to run a ping
command. It's safe to assume readers already know this. Therefore,
just explaining those counters is okay.


Thanks.

>
> Signed-off-by: yupeng <yupeng0921@gmail.com>
> ---
>  Documentation/networking/index.rst        |   1 +
>  Documentation/networking/snmp_counter.rst | 963 ++++++++++++++++++++++
>  2 files changed, 964 insertions(+)
>  create mode 100644 Documentation/networking/snmp_counter.rst
>
> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> index bd89dae8d578..6a47629ef8ed 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -31,6 +31,7 @@ Contents:
>     net_failover
>     alias
>     bridge
> +   snmp_counter
>
>  .. only::  subproject
>
> diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
> new file mode 100644
> index 000000000000..2939c5acf675
> --- /dev/null
> +++ b/Documentation/networking/snmp_counter.rst
> @@ -0,0 +1,963 @@
> +====================
> +snmp counter tutorial
> +====================
> +
> +This document explains the meaning of snmp counters. For understanding
> +their meanings better, this document doesn't explain the counters one
> +by one, but creates a set of experiments, and explains the counters
> +depend on the experiments' results. The experiments are on one or two
> +virtual machines. Except for the test commands we use in the experiments,
> +the virtual machines have no other network traffic. We use the 'nstat'
> +command to get the values of snmp counters, before every test, we run
> +'nstat -n' to update the history, so the 'nstat' output would only
> +show the changes of the snmp counters. For more information about
> +nstat, please refer:
> +
> +http://man7.org/linux/man-pages/man8/nstat.8.html
> +
> +icmp ping
> +========
> +
> +Run the ping command against the public dns server 8.8.8.8::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
> +  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
> +
> +  --- 8.8.8.8 ping statistics ---
> +  1 packets transmitted, 1 received, 0% packet loss, time 0ms
> +  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
> +
> +The nstayt result::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  IcmpInMsgs                      1                  0.0
> +  IcmpInEchoReps                  1                  0.0
> +  IcmpOutMsgs                     1                  0.0
> +  IcmpOutEchos                    1                  0.0
> +  IcmpMsgInType0                  1                  0.0
> +  IcmpMsgOutType8                 1                  0.0
> +  IpExtInOctets                   84                 0.0
> +  IpExtOutOctets                  84                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +The nstat output could be divided into two part: one with the 'Ext'
> +keyword, another without the 'Ext' keyword. If the counter name
> +doesn't have 'Ext', it is defined by one of snmp rfc, if it has 'Ext',
> +it is a kernel extent counter. Below we explain them one by one.
> +
> +The rfc defined counters
> +----------------------
> +
> +* IpInReceives
> +The total number of input datagrams received from interfaces,
> +including those received in error.
> +
> +https://tools.ietf.org/html/rfc1213#page-26
> +
> +* IpInDelivers
> +The total number of input datagrams successfully delivered to IP
> +user-protocols (including ICMP).
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IpOutRequests
> +The total number of IP datagrams which local IP user-protocols
> +(including ICMP) supplied to IP in requests for transmission.  Note
> +that this counter does not include any datagrams counted in
> +ipForwDatagrams.
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IcmpInMsgs
> +The total number of ICMP messages which the entity received.  Note
> +that this counter includes all those counted by icmpInErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-41
> +
> +* IcmpInEchoReps
> +The number of ICMP Echo Reply messages received.
> +
> +https://tools.ietf.org/html/rfc1213#page-42
> +
> +* IcmpOutMsgs
> +The total number of ICMP messages which this entity attempted to send.
> +Note that this counter includes all those counted by icmpOutErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-43
> +
> +* IcmpOutEchos
> +The number of ICMP Echo (request) messages sent.
> +
> +https://tools.ietf.org/html/rfc1213#page-45
> +
> +IcmpMsgInType0 and IcmpMsgOutType8 are not defined by any snmp related
> +RFCs, but their meaning are quite straightforward, they count the
> +packet number of specific icmp packet types. We could find the icmp
> +types here:
> +
> +https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
> +
> +Type 8 is echo, type 0 is echo reply.
> +
> +Until now, we can easily explain these items of the nstat: We sent an
> +icmp echo request, so IpOutRequests, IcmpOutMsgs, IcmpOutEchos and
> +IcmpMsgOutType8 were increased 1. We got icmp echo reply from 8.8.8.8,
> +so IpInReceives, IcmpInMsgs, IcmpInEchoReps, IcmpMsgInType0 were
> +increased 1. The icmp echo reply was passed to icmp layer via ip
> +layer, so IpInDelivers was increased 1.
> +
> +Please note, these metrics don't aware LRO/GRO, e.g., IpOutRequests
> +might count 1 packet, but hardware splits it to 2, and sends them
> +separately.
> +
> +IpExtInOctets and IpExtOutOctets
> +------------------------------
> +They are linux kernel extensions, no rfc definitions. Please note,
> +rfc1213 indeed defines ifInOctets  and ifOutOctets, but they
> +are different things. The ifInOctets and ifOutOctets are packets
> +size which includes the mac layer. But IpExtInOctets and IpExtOutOctets
> +are only ip layer sizes.
> +
> +In our example, an ICMP echo request has four parts:
> +* 14 bytes mac header
> +* 20 bytes ip header
> +* 16 bytes icmp header
> +* 48 bytes data (default value of the ping command)
> +
> +So IpExtInOctets value is 20+16+48=84. The IpExtOutOctets is similar.
> +
> +IpExtInNoECTPkts
> +---------------
> +We could find IpExtInNoECTPkts in the nstat output, but kernel provide
> +four similar counters, we explain them together, they are:
> +* IpExtInNoECTPkts
> +* IpExtInECT1Pkts
> +* IpExtInECT0Pkts
> +* IpExtInCEPkts
> +
> +They indicate four kinds of ECN IP packets, they are defined here:
> +
> +https://tools.ietf.org/html/rfc3168#page-6
> +
> +These 4 counters calculate how many packets received per ECN
> +status. They count the real frame number regardless the LRO/GRO. So
> +for the same packet, you might find that IpInReceives count 1, but
> +IpExtInNoECTPkts counts 2 or more.
> +
> +additional explain
> +-----------------
> +The ip layer counters are recorded by the ip layer code in the kernel. I mean, if you send a packet to a lower layer directly, Linux
> +kernel won't record it. For example, tcpreplay will open an
> +AF_PACKET socket, and send the packet to layer 2, although it could send
> +an IP packet, you can't find it from the nstat output. Here is an
> +example:
> +
> +We capture the ping packet by tcpdump::
> +
> +  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/ping.pcap dst 8.8.8.8
> +
> +Then run ping command::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +
> +Terminate tcpdump by Ctrl-C, and run 'nstat -n' to update the nstat
> +history. Then run tcpreplay::
> +
> +  nstatuser@nstat-a:~$ sudo tcpreplay --intf1=ens3 /tmp/ping.pcap
> +  Actual: 1 packets (98 bytes) sent in 0.000278 seconds
> +  Rated: 352517.9 Bps, 2.82 Mbps, 3597.12 pps
> +  Flows: 1 flows, 3597.12 fps, 1 flow packets, 0 non-flow
> +  Statistics for network device: ens3
> +          Successful packets:        1
> +          Failed packets:            0
> +          Truncated packets:         0
> +          Retried packets (ENOBUFS): 0
> +          Retried packets (EAGAIN):  0
> +
> +Check the nstat output::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IcmpInMsgs                      1                  0.0
> +  IcmpInEchoReps                  1                  0.0
> +  IcmpMsgInType0                  1                  0.0
> +  IpExtInOctets                   84                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +We can see, nstat only show the received packet, because the IP layer
> +of kernel only know the reply of 8.8.8.8, it doesn't know what
> +tcpreplay sent.
> +
> +At the same time, when you use AF_INET socket, even you use the
> +SOCK_RAW option, the IP layer will still try to verify whether the
> +packet is an ICMP packet, if it is, kernel will still count it to its
> +counters and you can find it in the output of nstat.
> +
> +tcp 3 way handshake
> +==================
> +
> +On server side, we run::
> +
> +  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client side, we run::
> +
> +  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
> +  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
> +
> +The server listened on tcp 9000 port, the client connected to it, they
> +completed the 3-way handshake.
> +
> +On server side, we can find below nstat output::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpPassiveOpens                 1                  0.0
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +
> +On client side, we can find below nstat output::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i tcp
> +  TcpActiveOpens                  1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      2                  0.0
> +
> +Except for TcpExtTCPPureAcks, all other counters are defined by rfc1213
> +
> +* TcpActiveOpens
> +The number of times TCP connections have made a direct transition to
> +the SYN-SENT state from the CLOSED state.
> +
> +https://tools.ietf.org/html/rfc1213#page-47
> +
> +* TcpPassiveOpens
> +The number of times TCP connections have made a direct transition to
> +the SYN-RCVD state from the LISTEN state.
> +
> +https://tools.ietf.org/html/rfc1213#page-47
> +
> +* TcpInSegs
> +The total number of segments received, including those received in
> +error.  This count includes segments received on currently established
> +connections.
> +
> +https://tools.ietf.org/html/rfc1213#page-48
> +
> +* TcpOutSegs
> +The total number of segments sent, including those on current
> +connections but excluding those containing only retransmitted octets.
> +
> +https://tools.ietf.org/html/rfc1213#page-48
> +
> +
> +The TcpExtTCPPureAcks is an extension in linux kernel. When kernel
> +receives a TCP packet which set ACK flag and with no data, either
> +TcpExtTCPPureAcks or TcpExtTCPHPAcks will increase 1. We will discuss
> +it in a later section.
> +
> +Now we can easily explain the nstat outputs on the server side and client
> +side.
> +
> +When the server received the first syn, it replied a syn+ack, and came into
> +SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
> +syn, sent syn+ack, received ack, so server sent 1 packet, received 2
> +packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ack
> +of the 3-way handshake is a pure ack without data, so
> +TcpExtTCPPureAcks increased 1.
> +
> +When the client sent syn, the client came into the SYN-SENT state, so
> +TcpActiveOpens increased 1, client sent syn, received syn+ack, sent
> +ack, so client sent 2 packets, received 1 packet, TcpInSegs increased
> +1, TcpOutSegs increased 2.
> +
> +Note: about TcpInSegs and TcpOutSegs, rfc1213 doesn't define the
> +behaviors when gso/gro/tso are enabled on the NIC (network interface
> +card). On current linux implementation, TcpOutSegs awares gso/tso, but
> +TcpInSegs doesn't aware gro. So TcpOutSegs will count the actual
> +packet number even only 1 packet is sent via tcp layer. If multiple
> +packets arrived at a NIC, and they are merged into 1 packet, TcpInSegs
> +will only count 1.
> +
> +tcp disconnect
> +=============
> +
> +Continue our previous example, on the server side, we have run::
> +
> +  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client side, we have run::
> +
> +  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
> +  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
> +
> +Now we type Ctrl-C on the client side, stop the tcp connection between the
> +two nc command. Then we check the nstat output.
> +
> +On server side::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +
> +On client side::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +
> +Wait for more than 1 minute, run nstat on client again::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i tcp
> +  TcpExtTW                        1                  0.0
> +
> +Most of the counters are explained in the previous section except
> +two: TcpExtTCPOrigDataSent and TcpExtTW. Both of them are linux kernel
> +extensions.
> +
> +TcpExtTW means a tcp connection is closed normally via
> +time wait stage, not via tcp reuse process.
> +
> +About TcpExtTCPOrigDataSent, Below kernel patch has a good explanation:
> +
> +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
> +
> +I pasted it here::
> +
> +  TCPOrigDataSent: number of outgoing packets with original data
> +  (excluding retransmission but including data-in-SYN). This counter is
> +  different from TcpOutSegs because TcpOutSegs also tracks pure
> +  ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate.
> +
> +the effect of gso and gro
> +=======================
> +
> +The Generic Segmentation Offload (GSO) and Generic Receive Offload
> +would affect the metrics of the packet in/out on both ip and tcp
> +layer. Here is an iperf example. Before the test, run below command to
> +make sure both gso and gro are enabled on the NIC::
> +
> +  $ sudo ethtool -k ens3 | egrep '(generic-segmentation-offload|generic-receive-offload)'
> +  generic-segmentation-offload: on
> +  generic-receive-offload: on
> +
> +On server side, run::
> +
> +  iperf3 -s -p 9000
> +
> +On client side, run::
> +
> +  iperf3 -c 192.168.122.251 -p 9000 -t 5 -P 10
> +
> +The server listened on tcp port 9000, the client connected to the server,
> +created 10 threads parallel, run 5 seconds. After the pierf3 stopped, we
> +run nstat on both the server and client.
> +
> +On server side::
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    36346              0.0
> +  IpInDelivers                    36346              0.0
> +  IpOutRequests                   33836              0.0
> +  TcpPassiveOpens                 11                 0.0
> +  TcpEstabResets                  2                  0.0
> +  TcpInSegs                       36346              0.0
> +  TcpOutSegs                      33836              0.0
> +  TcpOutRsts                      20                 0.0
> +  TcpExtDelayedACKs               26                 0.0
> +  TcpExtTCPHPHits                 32120              0.0
> +  TcpExtTCPPureAcks               16                 0.0
> +  TcpExtTCPHPAcks                 5                  0.0
> +  TcpExtTCPAbortOnData            5                  0.0
> +  TcpExtTCPAbortOnClose           2                  0.0
> +  TcpExtTCPRcvCoalesce            7306               0.0
> +  TcpExtTCPOFOQueue               1354               0.0
> +  TcpExtTCPOrigDataSent           15                 0.0
> +  IpExtInOctets                   311732432          0.0
> +  IpExtOutOctets                  1785119            0.0
> +  IpExtInNoECTPkts                214032             0.0
> +
> +Client side::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    33836              0.0
> +  IpInDelivers                    33836              0.0
> +  IpOutRequests                   43786              0.0
> +  TcpActiveOpens                  11                 0.0
> +  TcpEstabResets                  10                 0.0
> +  TcpInSegs                       33836              0.0
> +  TcpOutSegs                      214072             0.0
> +  TcpRetransSegs                  3876               0.0
> +  TcpExtDelayedACKs               7                  0.0
> +  TcpExtTCPHPHits                 5                  0.0
> +  TcpExtTCPPureAcks               2719               0.0
> +  TcpExtTCPHPAcks                 31071              0.0
> +  TcpExtTCPSackRecovery           607                0.0
> +  TcpExtTCPSACKReorder            61                 0.0
> +  TcpExtTCPLostRetransmit         90                 0.0
> +  TcpExtTCPFastRetrans            3806               0.0
> +  TcpExtTCPSlowStartRetrans       62                 0.0
> +  TcpExtTCPLossProbes             38                 0.0
> +  TcpExtTCPSackRecoveryFail       8                  0.0
> +  TcpExtTCPSackShifted            203                0.0
> +  TcpExtTCPSackMerged             778                0.0
> +  TcpExtTCPSackShiftFallback      700                0.0
> +  TcpExtTCPSpuriousRtxHostQueues  4                  0.0
> +  TcpExtTCPAutoCorking            14                 0.0
> +  TcpExtTCPOrigDataSent           214038             0.0
> +  TcpExtTCPHystartTrainDetect     8                  0.0
> +  TcpExtTCPHystartTrainCwnd       172                0.0
> +  IpExtInOctets                   1785227            0.0
> +  IpExtOutOctets                  317789680          0.0
> +  IpExtInNoECTPkts                33836              0.0
> +
> +The TcpOutSegs and IpOutRequests on the server are 33836, exactly the
> +same as IpExtInNoECTPkts, IpInReceives, IpInDelivers and TcpInSegs on
> +the client side. During iperf3 test, the server only reply very short
> +packets, so gso and gro has no effect on the server's reply.
> +
> +On the client side, TcpOutSegs is 214072, IpOutRequests is 43786, the
> +tcp layer packet out is larger than ip layer packet out, because
> +TcpOutSegs count the packet number after gso, but IpOutRequests
> +doesn't. On the server side, IpExtInNoECTPkts is 214032, this number
> +is smaller a little than the TcpOutSegs on the client side (214072), it
> +might cause by the packet loss. The IpInReceives, IpInDelivers and
> +TcpInSegs are obviously smaller than the TcpOutSegs on the client side,
> +because these counters calculate the packet after gro.
> +
> +tcp counters in established state
> +================================
> +
> +Run nc on server::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +Run nc on client:
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +
> +Input a string in the nc client ('hello' in our example):
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +  hello
> +
> +The client side nstat output:
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +  IpExtInOctets                   52                 0.0
> +  IpExtOutOctets                  58                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +The server side nstat output:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  IpExtInOctets                   58                 0.0
> +  IpExtOutOctets                  52                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +Input a string in nc client side again ('world' in our exmaple):
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +  hello
> +  world
> +
> +Client side nstat output:
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPHPAcks                 1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +  IpExtInOctets                   52                 0.0
> +  IpExtOutOctets                  58                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +
> +Server side nstat output:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPHPHits                 1                  0.0
> +  IpExtInOctets                   58                 0.0
> +  IpExtOutOctets                  52                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +Compare the first client side output and the second client side
> +output, we could find one difference: the first one had a
> +'TcpExtTCPPureAcks', but the second one had a
> +'TcpExtTCPHPAcks'. The first server side output and the second server
> +side output had a difference too: the second server side output had a
> +TcpExtTCPHPHits, but the first server side output didn't have it. The
> +network traffic patterns were exactly the same: the client sent a packet to the server, the server replied an ack. But kernel handled them in different
> +ways. When kernel receives a tpc packet in the established status,
> +kernel has two paths to handle the packet, one is fast path, another
> +is slow path. The comment in kernel code provides a good explanation of
> +them, I paste them below:
> +
> +  It is split into a fast path and a slow path. The fast path is
> +  disabled when:
> +  - A zero window was announced from us - zero window probing
> +    is only handled properly on the slow path.
> +  - Out of order segments arrived.
> +  - Urgent data is expected.
> +  - There is no buffer space left
> +  - Unexpected TCP flags/window values/header lengths are received
> +    (detected by checking the TCP header against pred_flags)
> +  - Data is sent in both directions. The fast path only supports pure senders
> +    or pure receivers (this means either the sequence number or the ack
> +    value must stay constant)
> +  - Unexpected TCP option.
> +
> +Kernel will try to use fast path unless any of the above conditions
> +are satisfied. If the packets are out of order, kernel will handle
> +them in slow path, which means the performance might be not very
> +good. Kernel would also come into slow path if the "Delayed ack" is
> +used, because when using "Delayed ack", the data is sent in both
> +directions. When the tcp window scale option is not used, kernel will
> +try to enable fast path immediately when the connection comes into the established
> +state, but if the tcp window scale option is used, kernel will disable
> +the fast path at first, and try to enable it after kerenl receives
> +packets. We could use the 'ss' command to verify whether the window
> +scale option is used. e.g. run below command on either server or
> +client:
> +
> +  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
> +  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port
> +  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000
> +             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
> +
> +The 'wscale:7,7' means both server and client set the window scale
> +option to 7. Now we could explain the nstat output in our test:
> +
> +In the first nstat output of client side, the client sent a packet, server
> +reply an ack, when kernel handled this ack, the fast path was not
> +enabled, so the ack was counted into 'TcpExtTCPPureAcks'.
> +In the second nstat output of client side, the client sent a packet again,
> +and received another ack from the server, this time, the fast path is
> +enabled, and the ack was qualified for fast path, so it was handled by
> +the fast path, so this ack was counted into TcpExtTCPHPAcks.
> +In the first nstat output of server side, the fast path was not enabled,
> +so there was no 'TcpExtTCPHPHits'.
> +In the second nstat output of server side, the fast path was enabled,
> +and the packet received from client qualified for fast path, so it
> +was counted into 'TcpExtTCPHPHits'.
> +
> +tcp abort
> +========
> +
> +Some counters indicate the reaons why tcp layer want to send a rst,
> +they are:
> +* TcpExtTCPAbortOnData
> +* TcpExtTCPAbortOnClose
> +* TcpExtTCPAbortOnMemory
> +* TcpExtTCPAbortOnTimeout
> +* TcpExtTCPAbortOnLinger
> +* TcpExtTCPAbortFailed
> +
> +TcpExtTCPAbortOnData
> +-------------------
> +
> +It means tcp layer has data in flight, but need to close the
> +connection. So tcp layer sends a rst to the other side, indicate the
> +connection is not closed very graceful. An easy way to increase this
> +counter is using the SO_LINGER option. Please refer to the SO_LINGER
> +section of the socket man page:
> +
> +http://man7.org/linux/man-pages/man7/socket.7.html).
> +
> +By default, when an application closes a connection, the close function
> +will return immediately and kernel will try to send the in-flight data
> +async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
> +to a positive number, the close function won't return immediately, but
> +wait for the in-flight data are acked by the other side, the max wait
> +time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
> +when the application closes a connection, kernel will send an rst
> +immediately, and increase the TcpExtTCPAbortOnData counter.
> +
> +We run nc on the server side::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +Run below python code on the client side::
> +
> +  import socket
> +  import struct
> +
> +  server = 'nstat-b' # server address
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))
> +  s.connect((server, port))
> +  s.close()
> +
> +On client side, we could see TcpExtTCPAbortOnData increased::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnData            1                  0.0
> +
> +If we capture packet by tcpdump, we could see the client send rst
> +instead of fin.
> +
> +
> +TcpExtTCPAbortOnClose
> +--------------------
> +
> +This counter means the tcp layer has unread data when an application
> +want to close a connection.
> +
> +On the server side, we run below python script:
> +
> +  import socket
> +  import time
> +
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(1)
> +  sock, addr = s.accept()
> +  while True:
> +      time.sleep(9999999)
> +
> +This python script listen on 9000 port, but doesn't read anything from
> +the connection.
> +
> +On the client side, we send the string "hello" by nc:
> +
> +  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
> +
> +Then, we come back to the server side, the server has received the "hello"
> +packet, and tcp layer has acked this packet, but the application didn't
> +read it yet. We type Ctrl-C to terminate the server script. Then we
> +could find TcpExtTCPAbortOnClose increased 1 on the server side:
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnClose           1                  0.0
> +
> +If we run tcpdump on the server side, we could find the server sent a
> +rst after we type Ctrl-C.
> +
> +TcpExtTCPAbortOnMemory
> +--------------------
> +
> +When an application closes a tcp connection, kernel still need to track
> +the connection, let it complete the tcp disconnect process. E.g. an
> +app calls the close method of a socket, kernel sends fin to the other
> +side of the connection, then the app has no relationship with the
> +socket any more, but kernel need to keep the socket, this socket
> +becomes an orphan socket, kernel waits for the reply of the other side,
> +and would come to the TIME_WAIT state finally. When kernel has no
> +enough memory to keep the orphan socket, kernel would send an rst to
> +the other side, and delete the socket, in such situation, kernel will
> +increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
> +TcpExtTCPAbortOnMemory:
> +
> +* the memory used by tcp protocol is higher than the third value of
> +the tcp_mem. Please refer the tcp_mem section in the tcp man page:
> +
> +http://man7.org/linux/man-pages/man7/tcp.7.html
> +
> +* the orphan socket count is higher than net.ipv4.tcp_max_orphans
> +
> +Below is an example which let the orphan socket count be higher than
> +net.ipv4.tcp_max_orphans.
> +
> +Change tcp_max_orphans to a smaller value on client::
> +
> +  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
> +
> +Client code (create 64 connection to server)::
> +
> +  nstatuser@nstat-a:~$ cat client_orphan.py
> +  import socket
> +  import time
> +
> +  server = 'nstat-b' # server address
> +  port = 9000
> +
> +  count = 64
> +
> +  connection_list = []
> +
> +  for i in range(64):
> +      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +      s.connect((server, port))
> +      connection_list.append(s)
> +      print("connection_count: %d" % len(connection_list))
> +
> +  while True:
> +      time.sleep(99999)
> +
> +Server code (accept 64 connection from client)::
> +
> +  nstatuser@nstat-b:~$ cat server_orphan.py
> +  import socket
> +  import time
> +
> +  port = 9000
> +  count = 64
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(count)
> +  connection_list = []
> +  while True:
> +      sock, addr = s.accept()
> +      connection_list.append((sock, addr))
> +      print("connection_count: %d" % len(connection_list))
> +
> +Run the python scripts on server and client.
> +
> +On server::
> +
> +  python3 server_orphan.py
> +
> +On client::
> +
> +  python3 client_orphan.py
> +
> +Run iptables on server::
> +
> +  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
> +
> +Type Ctrl-C on client, stop client_orphan.py.
> +
> +Check TcpExtTCPAbortOnMemory on client::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnMemory          54                 0.0
> +
> +Check orphane socket count on client::
> +
> +  nstatuser@nstat-a:~$ ss -s
> +  Total: 131 (kernel 0)
> +  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
> +
> +  Transport Total     IP        IPv6
> +  *         0         -         -
> +  RAW       1         0         1
> +  UDP       1         1         0
> +  TCP       14        13        1
> +  INET      16        14        2
> +  FRAG      0         0         0
> +
> +The explanation of the test: after run server_orphan.py and
> +client_orphan.py, we set up 64 connections between server and
> +client. Run the iptables command, the server will drop all packets from
> +the client, type Ctrl-C on client_orphan.py, the system of the client
> +would try to close these connections, and before they are closed
> +gracefully, these connections became orphan sockets. As the iptables
> +of the server blocked packets from the client, the server won't receive fin
> +from the client, so all connection on clients would be stuck on FIN_WAIT_1
> +stage, so they will keep as orphan sockets until timeout. We have echo
> +10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
> +only keep 10 orphan sockets, for all other orphan sockets, the client
> +system sent rst for them and delete them. We have 64 connections, so
> +the 'ss -s' command shows the system has 10 orphan sockets, and the
> +value of TcpExtTCPAbortOnMemory was 54.
> +
> +An additional explanation about orphan socket count: You could find the
> +exactly orphan socket count by the 'ss -s' command, but when kernel
> +decide whither increases TcpExtTCPAbortOnMemory and sends rst, kernel
> +doesn't always check the exactly orphan socket count. For increasing
> +performance, kernel checks an approximate count firstly, if the
> +approximate count is more than tcp_max_orphans, kernel checks the
> +exact count again. So if the approximate count is less than
> +tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
> +would find TcpExtTCPAbortOnMemory is not increased at all. If
> +tcp_max_orphans is large enough, it won't occur, but if you decrease
> +tcp_max_orphans to a small value like our test, you might find this
> +issue. So in our test, the client set up 64 connections although the
> +tcp_max_orphans is 10. If the client only set up 11 connections, we
> +can't find the change of TcpExtTCPAbortOnMemory.
> +
> +TcpExtTCPAbortOnTimeout
> +----------------------
> +This counter will increase when any of the tcp timers expire. In this
> +situation, kernel won't send rst, just give up the connection.
> +Continue the previous test, we wait for several minutes, because the
> +iptables on the server blocked the traffic, the server wouldn't receive
> +fin, and all the client's orphan sockets would timeout on the
> +FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
> +10 timeout on the client::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnTimeout         10                 0.0
> +
> +TcpExtTCPAbortOnLinger
> +---------------------
> +When a tcp connection comes into FIN_WAIT_2 state, instead of waiting
> +for the fin packet from the other side, kernel could send a rst and
> +delete the socket immediately. This is not the default behavior of
> +linux kernel tcp stack, but after configuring socket option, you could
> +let kernel follow this behavior. Below is an example.
> +
> +The server side code::
> +
> +  nstatuser@nstat-b:~$ cat server_linger.py
> +  import socket
> +  import time
> +
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(1)
> +  sock, addr = s.accept()
> +  while True:
> +      time.sleep(9999999)
> +
> +The client side code::
> +
> +  nstatuser@nstat-a:~$ cat client_linger.py
> +  import socket
> +  import struct
> +
> +  server = 'nstat-b' # server address
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
> +  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
> +  s.connect((server, port))
> +  s.close()
> +
> +Run server_linger.py on server::
> +
> +  nstatuser@nstat-b:~$ python3 server_linger.py
> +
> +Run client_linger.py on client::
> +
> +  nstatuser@nstat-a:~$ python3 client_linger.py
> +
> +After run client_linger.py, check the output of nstat::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnLinger          1                  0.0
> +
> +TcpExtTCPAbortFailed
> +-------------------
> +The kernel tcp layer will send rst if the RFC 2525 2.17 section is satisfied:
> +
> +https://tools.ietf.org/html/rfc2525#page-50
> +
> +If an internal error occurs during this process, TcpExtTCPAbortFailed
> +will be increased.
> +
> +TcpExtListenOverflows and TcpExtListenDrops
> +========================================
> +When kernel receive a syn from a client, and if the tcp accept queue
> +is full, kernel will drop the syn and add 1 to TcpExtListenOverflows.
> +At the same time kernel will also add 1 to TcpExtListenDrops. When
> +a tcp socket is in LISTEN state, and kernel need to drop a packet,
> +kernel would always add 1 to TcpExtListenDrops. So increase
> +TcpExtListenOverflows would let TcpExtListenDrops increasing at the
> +same time, but TcpExtListenDrops would also increase without
> +TcpExtListenOverflows increasing, e.g. a memory allocation fail would
> +also let TcpExtListenDrops increase.
> +
> +Note: The above explain bases on kernel 4.15 or above version, on an
> +old kernel, the tcp stack has different behavior when tcp accept queue
> +is full. On the old kernel, tcp stack won't drop the syn, it would
> +complete the 3-way handshake, but as the accept queue is full, tcp
> +stack will keep the socket in the tcp half-open queue. As it is in the
> +half open queue, tcp stack will send syn+ack on an exponential backoff
> +timer, after client replies ack, tcp stack checks whether the accept
> +queue is still full, if it is not full, move the socket to accept
> +queue, if it is full, keeps the socket in the half-open queue, at next
> +time client replies ack, this socket will get another chance to move
> +to the accept queue.
> +
> +Here is an example:
> +
> +On server, run the nc command, listen on port 9000::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client, run 3 nc commands in different terminals::
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +
> +The nc command only accepts 1 connection, and the accept queue length
> +is 1. On current linux implementation, set queue length to n means the
> +actual queue length is n+1. Now we create 3 connections, 1 is accepted
> +by nc, 2 in accepted queue, so the accept queue is full.
> +
> +Before running the 4th nc, we clean the nstat history on the server:
> +
> +  nstatuser@nstat-b:~$ nstat -n
> +
> +Run the 4th nc on the client:
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +
> +If the nc server is running on kernel 4.15 or higher version, you
> +won't see the "Connection to ... succeeded!" string, because kernel
> +will drop the syn if the accept queue is full. If the nc client is running
> +on an old kernel, you could see that the connection is succeeded,
> +because kernel would complete the 3-way handshake and keep the socket
> +on the half-open queue.
> +
> +Our test is on kernel 4.15, run nstat on the server:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    4                  0.0
> +  IpInDelivers                    4                  0.0
> +  TcpInSegs                       4                  0.0
> +  TcpExtListenOverflows           4                  0.0
> +  TcpExtListenDrops               4                  0.0
> +  IpExtInOctets                   240                0.0
> +  IpExtInNoECTPkts                4                  0.0
> +
> +We can see both TcpExtListenOverflows and TcpExtListenDrops are 4. If
> +the time between the 4th nc and the nstat is longer, the value of
> +TcpExtListenOverflows and TcpExtListenDrops will be larger, because
> +the syn of the 4th nc is dropped, it keeps retrying.
> +
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] add an initial version of snmp_counter.rst
  2018-11-09 18:13 [PATCH] add an initial version of snmp_counter.rst yupeng
  2018-11-10  2:20 ` Cong Wang
@ 2018-11-10 16:56 ` Randy Dunlap
  1 sibling, 0 replies; 3+ messages in thread
From: Randy Dunlap @ 2018-11-10 16:56 UTC (permalink / raw)
  To: yupeng, netdev, xiyou.wangcong

On 11/9/18 10:13 AM, yupeng wrote:
> The snmp_counter.rst run a set of simple experiments, explains the
> meaning of snmp counters depend on the experiments' results. This is
> an initial version, only covers a small part of the snmp counters.
> 
> Signed-off-by: yupeng <yupeng0921@gmail.com>
> ---
>  Documentation/networking/index.rst        |   1 +
>  Documentation/networking/snmp_counter.rst | 963 ++++++++++++++++++++++
>  2 files changed, 964 insertions(+)
>  create mode 100644 Documentation/networking/snmp_counter.rst

Hi,

> diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
> new file mode 100644
> index 000000000000..2939c5acf675
> --- /dev/null
> +++ b/Documentation/networking/snmp_counter.rst
> @@ -0,0 +1,963 @@
> +====================
> +snmp counter tutorial
> +====================

First, I would change (almost) all occurrences of "snmp" to "SNMP", unless
it refers to a variable or struct or function etc. (something in C language)
or to the name of this file (document).

> +
> +This document explains the meaning of snmp counters. For understanding
> +their meanings better, this document doesn't explain the counters one
> +by one, but creates a set of experiments, and explains the counters
> +depend on the experiments' results. The experiments are on one or two

   depending on

> +virtual machines. Except for the test commands we use in the experiments,
> +the virtual machines have no other network traffic. We use the 'nstat'
> +command to get the values of snmp counters, before every test, we run

                                     counters;

> +'nstat -n' to update the history, so the 'nstat' output would only
> +show the changes of the snmp counters. For more information about
> +nstat, please refer:
> +
> +http://man7.org/linux/man-pages/man8/nstat.8.html
> +
> +icmp ping
> +========

s/icmp/ICMP/

> +
> +Run the ping command against the public dns server 8.8.8.8::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
> +  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
> +    
> +  --- 8.8.8.8 ping statistics ---
> +  1 packets transmitted, 1 received, 0% packet loss, time 0ms
> +  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
> +
> +The nstayt result::

       nstat

> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  IcmpInMsgs                      1                  0.0
> +  IcmpInEchoReps                  1                  0.0
> +  IcmpOutMsgs                     1                  0.0
> +  IcmpOutEchos                    1                  0.0
> +  IcmpMsgInType0                  1                  0.0
> +  IcmpMsgOutType8                 1                  0.0
> +  IpExtInOctets                   84                 0.0
> +  IpExtOutOctets                  84                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +The nstat output could be divided into two part: one with the 'Ext'
> +keyword, another without the 'Ext' keyword. If the counter name
> +doesn't have 'Ext', it is defined by one of snmp rfc, if it has 'Ext',

                                     by one of the SNMP RFCs; it is has

> +it is a kernel extent counter. Below we explain them one by one.
> +
> +The rfc defined counters
> +----------------------
s/rfc/RFC/

> +
> +* IpInReceives
> +The total number of input datagrams received from interfaces,
> +including those received in error.
> +
> +https://tools.ietf.org/html/rfc1213#page-26
> +
> +* IpInDelivers
> +The total number of input datagrams successfully delivered to IP
> +user-protocols (including ICMP).
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IpOutRequests
> +The total number of IP datagrams which local IP user-protocols
> +(including ICMP) supplied to IP in requests for transmission.  Note
> +that this counter does not include any datagrams counted in
> +ipForwDatagrams.
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IcmpInMsgs
> +The total number of ICMP messages which the entity received.  Note
> +that this counter includes all those counted by icmpInErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-41
> +
> +* IcmpInEchoReps
> +The number of ICMP Echo Reply messages received.
> +
> +https://tools.ietf.org/html/rfc1213#page-42
> +
> +* IcmpOutMsgs
> +The total number of ICMP messages which this entity attempted to send.
> +Note that this counter includes all those counted by icmpOutErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-43
> +
> +* IcmpOutEchos
> +The number of ICMP Echo (request) messages sent.
> +
> +https://tools.ietf.org/html/rfc1213#page-45
> +
> +IcmpMsgInType0 and IcmpMsgOutType8 are not defined by any snmp related

                                                             SNMP-related

> +RFCs, but their meaning are quite straightforward, they count the

                   meanings          straightforward: they count the

> +packet number of specific icmp packet types. We could find the icmp

s/could/can/
s/icmp/ICMP/ 2 times

> +types here:
> +
> +https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
> +
> +Type 8 is echo, type 0 is echo reply.
> +
> +Until now, we can easily explain these items of the nstat: We sent an
> +icmp echo request, so IpOutRequests, IcmpOutMsgs, IcmpOutEchos and

   ICMP

> +IcmpMsgOutType8 were increased 1. We got icmp echo reply from 8.8.8.8,

                                            ICMP

> +so IpInReceives, IcmpInMsgs, IcmpInEchoReps, IcmpMsgInType0 were
> +increased 1. The icmp echo reply was passed to icmp layer via ip

s/icmp/ICMP/ 2 times
s/ip/IP/

> +layer, so IpInDelivers was increased 1.
> +
> +Please note, these metrics don't aware LRO/GRO, e.g., IpOutRequests

                              aren't aware of

> +might count 1 packet, but hardware splits it to 2, and sends them
> +separately.
> +
> +IpExtInOctets and IpExtOutOctets
> +------------------------------
> +They are linux kernel extensions, no rfc definitions. Please note,

s/linux/Linux/
s/no rfc/not RFC/

> +rfc1213 indeed defines ifInOctets  and ifOutOctets, but they
> +are different things. The ifInOctets and ifOutOctets are packets

s/packets/packet/

> +size which includes the mac layer. But IpExtInOctets and IpExtOutOctets

s/size/sizes/
s/mac/MAC/

> +are only ip layer sizes.

s/ip/IP/

> +
> +In our example, an ICMP echo request has four parts:
> +* 14 bytes mac header

              MAC

> +* 20 bytes ip header

              IP

> +* 16 bytes icmp header

              ICMP

> +* 48 bytes data (default value of the ping command)
> +
> +So IpExtInOctets value is 20+16+48=84. The IpExtOutOctets is similar.

                                                             value is similar.

> +
> +IpExtInNoECTPkts
> +---------------
> +We could find IpExtInNoECTPkts in the nstat output, but kernel provide

   We can find                                                    provides

> +four similar counters, we explain them together, they are:

                counters.  We explain them togther; they are:

> +* IpExtInNoECTPkts
> +* IpExtInECT1Pkts
> +* IpExtInECT0Pkts
> +* IpExtInCEPkts
> +
> +They indicate four kinds of ECN IP packets, they are defined here:

                                      packets;

> +
> +https://tools.ietf.org/html/rfc3168#page-6
> +
> +These 4 counters calculate how many packets received per ECN
> +status. They count the real frame number regardless the LRO/GRO. So

                                     numbers regardless of LRO/GRO. So

> +for the same packet, you might find that IpInReceives count 1, but

                                       that the IpInReceives count is 1, but

> +IpExtInNoECTPkts counts 2 or more.

   the IpExtInNoECTPkts count is 2 or more.

> +
> +additional explain
> +-----------------

additional explanation
----------------------
  (note: the "---" line must be the same length as the "heading" line above it.

> +The ip layer counters are recorded by the ip layer code in the kernel. I mean, if you send a packet to a lower layer directly, Linux

s/ip/IP/ 2 times
Also split that line so that it is < 80 columns wide.
Then:
                                                                        This means that
if you send a packet to a lower layer directly, Linux

> +kernel won't record it. For example, tcpreplay will open an
> +AF_PACKET socket, and send the packet to layer 2, although it could send

                                            layer 2, and although it could send

> +an IP packet, you can't find it from the nstat output. Here is an
> +example:
> +
> +We capture the ping packet by tcpdump::
> +
> +  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/ping.pcap dst 8.8.8.8
> +
> +Then run ping command::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +
> +Terminate tcpdump by Ctrl-C, and run 'nstat -n' to update the nstat
> +history. Then run tcpreplay::
> +
> +  nstatuser@nstat-a:~$ sudo tcpreplay --intf1=ens3 /tmp/ping.pcap
> +  Actual: 1 packets (98 bytes) sent in 0.000278 seconds
> +  Rated: 352517.9 Bps, 2.82 Mbps, 3597.12 pps
> +  Flows: 1 flows, 3597.12 fps, 1 flow packets, 0 non-flow
> +  Statistics for network device: ens3
> +          Successful packets:        1
> +          Failed packets:            0
> +          Truncated packets:         0
> +          Retried packets (ENOBUFS): 0
> +          Retried packets (EAGAIN):  0
> +
> +Check the nstat output::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IcmpInMsgs                      1                  0.0
> +  IcmpInEchoReps                  1                  0.0
> +  IcmpMsgInType0                  1                  0.0
> +  IpExtInOctets                   84                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +We can see, nstat only show the received packet, because the IP layer

We can see that nstat only shows the received packet, because the IP layer

> +of kernel only know the reply of 8.8.8.8, it doesn't know what

of the kernel only knows about the reply of 8.8.8.8; it doesn't know what

> +tcpreplay sent.
> +
> +At the same time, when you use AF_INET socket, even you use the

                                                  even if you use the

> +SOCK_RAW option, the IP layer will still try to verify whether the
> +packet is an ICMP packet, if it is, kernel will still count it to its

   packet is an ICMP packet. If it is, the kernel will

> +counters and you can find it in the output of nstat.

^^^ interesting.

> +
> +tcp 3 way handshake
> +==================

TCP 3-way handshake
===================

> +
> +On server side, we run::

   On the server side,

> +
> +  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client side, we run::

   On the client side,

> +
> +  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
> +  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
> +
> +The server listened on tcp 9000 port, the client connected to it, they
> +completed the 3-way handshake.
> +
> +On server side, we can find below nstat output::

   On the server side, we can see the following nstat output::

> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpPassiveOpens                 1                  0.0
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +
> +On client side, we can find below nstat output::

   On the client side, we can see the following nstat output::

> +
> +  nstatuser@nstat-a:~$ nstat | grep -i tcp
> +  TcpActiveOpens                  1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      2                  0.0
> +
> +Except for TcpExtTCPPureAcks, all other counters are defined by rfc1213

                                                                by rfc1213.
and that's just part of a file name; I would really prefer to see:
                                                                by RFC 1213.

> +
> +* TcpActiveOpens
> +The number of times TCP connections have made a direct transition to
> +the SYN-SENT state from the CLOSED state.
> +
> +https://tools.ietf.org/html/rfc1213#page-47
> +
> +* TcpPassiveOpens
> +The number of times TCP connections have made a direct transition to
> +the SYN-RCVD state from the LISTEN state.
> +
> +https://tools.ietf.org/html/rfc1213#page-47
> +
> +* TcpInSegs
> +The total number of segments received, including those received in
> +error.  This count includes segments received on currently established
> +connections.
> +
> +https://tools.ietf.org/html/rfc1213#page-48
> +
> +* TcpOutSegs
> +The total number of segments sent, including those on current
> +connections but excluding those containing only retransmitted octets.
> +
> +https://tools.ietf.org/html/rfc1213#page-48
> +
> +
> +The TcpExtTCPPureAcks is an extension in linux kernel. When kernel

                                         in the Linux kernel. When the kernel

> +receives a TCP packet which set ACK flag and with no data, either

                               sets the ACK flag and with no data, either

> +TcpExtTCPPureAcks or TcpExtTCPHPAcks will increase 1. We will discuss
> +it in a later section.

s/it/this/

> +
> +Now we can easily explain the nstat outputs on the server side and client
> +side.
> +
> +When the server received the first syn, it replied a syn+ack, and came into

                                           it replied with a syn+ack,

> +SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
> +syn, sent syn+ack, received ack, so server sent 1 packet, received 2
> +packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ack
> +of the 3-way handshake is a pure ack without data, so
> +TcpExtTCPPureAcks increased 1.
> +
> +When the client sent syn, the client came into the SYN-SENT state, so
> +TcpActiveOpens increased 1, client sent syn, received syn+ack, sent
> +ack, so client sent 2 packets, received 1 packet, TcpInSegs increased
> +1, TcpOutSegs increased 2.
> +
> +Note: about TcpInSegs and TcpOutSegs, rfc1213 doesn't define the
> +behaviors when gso/gro/tso are enabled on the NIC (network interface

                  GSO/GRO/TSO

> +card). On current linux implementation, TcpOutSegs awares gso/tso, but

                     Linux                            is aware of GSO/TSO, but

> +TcpInSegs doesn't aware gro. So TcpOutSegs will count the actual

             isn't aware of GRO.

> +packet number even only 1 packet is sent via tcp layer. If multiple

                 even if only 1 packet is sent via the TCP later.

> +packets arrived at a NIC, and they are merged into 1 packet, TcpInSegs
> +will only count 1.
> +
> +tcp disconnect
> +=============

TCP disconnect
==============

> +
> +Continue our previous example, on the server side, we have run::

   Continuing

> +
> +  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client side, we have run::
> +
> +  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
> +  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
> +
> +Now we type Ctrl-C on the client side, stop the tcp connection between the

                                          stopping the TCP

> +two nc command. Then we check the nstat output.

          commands.

> +
> +On server side::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +
> +On client side::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +
> +Wait for more than 1 minute, run nstat on client again::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i tcp
> +  TcpExtTW                        1                  0.0
> +
> +Most of the counters are explained in the previous section except
> +two: TcpExtTCPOrigDataSent and TcpExtTW. Both of them are linux kernel

s/linux/Linux/

> +extensions.
> +
> +TcpExtTW means a tcp connection is closed normally via

s/tcp/TCP/

> +time wait stage, not via tcp reuse process.

s/tcp/TCP/

> +
> +About TcpExtTCPOrigDataSent, Below kernel patch has a good explanation:

                                the below kernel patch has

> +
> +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
> +
> +I pasted it here::
> +
> +  TCPOrigDataSent: number of outgoing packets with original data
> +  (excluding retransmission but including data-in-SYN). This counter is
> +  different from TcpOutSegs because TcpOutSegs also tracks pure
> +  ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate.
> +
> +the effect of gso and gro
> +=======================

the effect of GSO and GRO
=========================

> +
> +The Generic Segmentation Offload (GSO) and Generic Receive Offload

                                                              Offload (GRO)

> +would affect the metrics of the packet in/out on both ip and tcp

                                                 on both IP and TCP

> +layer. Here is an iperf example. Before the test, run below command to

   layers.                                           run the command below to


> +make sure both gso and gro are enabled on the NIC::

                  GSO and GRO

> +
> +  $ sudo ethtool -k ens3 | egrep '(generic-segmentation-offload|generic-receive-offload)'
> +  generic-segmentation-offload: on
> +  generic-receive-offload: on
> +
> +On server side, run::

   On the server side,

> +
> +  iperf3 -s -p 9000
> +
> +On client side, run::

   On the client side,

> +
> +  iperf3 -c 192.168.122.251 -p 9000 -t 5 -P 10
> +
> +The server listened on tcp port 9000, the client connected to the server,

                          TCP
> +created 10 threads parallel, run 5 seconds. After the pierf3 stopped, we

                                               After iperf3 stopped, we

> +run nstat on both the server and client.
> +
> +On server side::

   On the server side::

> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    36346              0.0
> +  IpInDelivers                    36346              0.0
> +  IpOutRequests                   33836              0.0
> +  TcpPassiveOpens                 11                 0.0
> +  TcpEstabResets                  2                  0.0
> +  TcpInSegs                       36346              0.0
> +  TcpOutSegs                      33836              0.0
> +  TcpOutRsts                      20                 0.0
> +  TcpExtDelayedACKs               26                 0.0
> +  TcpExtTCPHPHits                 32120              0.0
> +  TcpExtTCPPureAcks               16                 0.0
> +  TcpExtTCPHPAcks                 5                  0.0
> +  TcpExtTCPAbortOnData            5                  0.0
> +  TcpExtTCPAbortOnClose           2                  0.0
> +  TcpExtTCPRcvCoalesce            7306               0.0
> +  TcpExtTCPOFOQueue               1354               0.0
> +  TcpExtTCPOrigDataSent           15                 0.0
> +  IpExtInOctets                   311732432          0.0
> +  IpExtOutOctets                  1785119            0.0
> +  IpExtInNoECTPkts                214032             0.0
> +
> +Client side::

   On the client side::

> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    33836              0.0
> +  IpInDelivers                    33836              0.0
> +  IpOutRequests                   43786              0.0
> +  TcpActiveOpens                  11                 0.0
> +  TcpEstabResets                  10                 0.0
> +  TcpInSegs                       33836              0.0
> +  TcpOutSegs                      214072             0.0
> +  TcpRetransSegs                  3876               0.0
> +  TcpExtDelayedACKs               7                  0.0
> +  TcpExtTCPHPHits                 5                  0.0
> +  TcpExtTCPPureAcks               2719               0.0
> +  TcpExtTCPHPAcks                 31071              0.0
> +  TcpExtTCPSackRecovery           607                0.0
> +  TcpExtTCPSACKReorder            61                 0.0
> +  TcpExtTCPLostRetransmit         90                 0.0
> +  TcpExtTCPFastRetrans            3806               0.0
> +  TcpExtTCPSlowStartRetrans       62                 0.0
> +  TcpExtTCPLossProbes             38                 0.0
> +  TcpExtTCPSackRecoveryFail       8                  0.0
> +  TcpExtTCPSackShifted            203                0.0
> +  TcpExtTCPSackMerged             778                0.0
> +  TcpExtTCPSackShiftFallback      700                0.0
> +  TcpExtTCPSpuriousRtxHostQueues  4                  0.0
> +  TcpExtTCPAutoCorking            14                 0.0
> +  TcpExtTCPOrigDataSent           214038             0.0
> +  TcpExtTCPHystartTrainDetect     8                  0.0
> +  TcpExtTCPHystartTrainCwnd       172                0.0
> +  IpExtInOctets                   1785227            0.0
> +  IpExtOutOctets                  317789680          0.0
> +  IpExtInNoECTPkts                33836              0.0
> +
> +The TcpOutSegs and IpOutRequests on the server are 33836, exactly the
> +same as IpExtInNoECTPkts, IpInReceives, IpInDelivers and TcpInSegs on
> +the client side. During iperf3 test, the server only reply very short

                                                        replies with very short

> +packets, so gso and gro has no effect on the server's reply.

            so GSO and GRO have no effect

> +
> +On the client side, TcpOutSegs is 214072, IpOutRequests is 43786, the
> +tcp layer packet out is larger than ip layer packet out, because

   TCP layer packet size out is larger than the IP layer packet size out, because

> +TcpOutSegs count the packet number after gso, but IpOutRequests

              counts                        GSO,

> +doesn't. On the server side, IpExtInNoECTPkts is 214032, this number

                                                 is 214032; this number

> +is smaller a little than the TcpOutSegs on the client side (214072), it

                                                                      ; it

> +might cause by the packet loss. The IpInReceives, IpInDelivers and

   might be caused by packet loss.

> +TcpInSegs are obviously smaller than the TcpOutSegs on the client side,
> +because these counters calculate the packet after gro.

                                        packet counts after GRO.

> +
> +tcp counters in established state
> +================================

TCP counters in established state
=================================

> +
> +Run nc on server::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +Run nc on client:
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +
> +Input a string in the nc client ('hello' in our example):
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +  hello
> +
> +The client side nstat output:
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +  IpExtInOctets                   52                 0.0
> +  IpExtOutOctets                  58                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +The server side nstat output:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  IpExtInOctets                   58                 0.0
> +  IpExtOutOctets                  52                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +Input a string in nc client side again ('world' in our exmaple):
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +  hello
> +  world
> +
> +Client side nstat output:
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPHPAcks                 1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +  IpExtInOctets                   52                 0.0
> +  IpExtOutOctets                  58                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +
> +Server side nstat output:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPHPHits                 1                  0.0
> +  IpExtInOctets                   58                 0.0
> +  IpExtOutOctets                  52                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +Compare the first client side output and the second client side

   Comparing

> +output, we could find one difference: the first one had a
> +'TcpExtTCPPureAcks', but the second one had a
> +'TcpExtTCPHPAcks'. The first server side output and the second server
> +side output had a difference too: the second server side output had a
> +TcpExtTCPHPHits, but the first server side output didn't have it. The

Next line is too long -- split it into < 80 columns.

> +network traffic patterns were exactly the same: the client sent a packet to the server, the server replied an ack. But kernel handled them in different

                       But the kernel handled them in different

> +ways. When kernel receives a tpc packet in the established status,

         When the kernel receives a TCP packet

> +kernel has two paths to handle the packet, one is fast path, another

   the kernel has two paths to handle the packet: one is the fast path and the other

> +is slow path. The comment in kernel code provides a good explanation of

   is the slow path.

> +them, I paste them below:
> +
> +  It is split into a fast path and a slow path. The fast path is
> +  disabled when:
> +  - A zero window was announced from us - zero window probing
> +    is only handled properly on the slow path.
> +  - Out of order segments arrived.
> +  - Urgent data is expected.
> +  - There is no buffer space left
> +  - Unexpected TCP flags/window values/header lengths are received
> +    (detected by checking the TCP header against pred_flags)
> +  - Data is sent in both directions. The fast path only supports pure senders
> +    or pure receivers (this means either the sequence number or the ack
> +    value must stay constant)
> +  - Unexpected TCP option.
> +
> +Kernel will try to use fast path unless any of the above conditions

   The kernel will try to use the fast path

> +are satisfied. If the packets are out of order, kernel will handle

   is satisfied.                            order, the kernel will handle

> +them in slow path, which means the performance might be not very

        in the slow path,

> +good. Kernel would also come into slow path if the "Delayed ack" is

         The kernel             into the slow path

> +used, because when using "Delayed ack", the data is sent in both
> +directions. When the tcp window scale option is not used, kernel will

                        TCP                        not used, the kernel will

> +try to enable fast path immediately when the connection comes into the established
> +state, but if the tcp window scale option is used, kernel will disable

                     TCP                        used, the kernel will

> +the fast path at first, and try to enable it after kerenl receives

                                                after the kernel receives

> +packets. We could use the 'ss' command to verify whether the window
> +scale option is used. e.g. run below command on either server or

                         E.g., run the command below on either the server or the

> +client:
> +
> +  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
> +  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port     
> +  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000     
> +             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
> +
> +The 'wscale:7,7' means both server and client set the window scale
> +option to 7. Now we could explain the nstat output in our test:
> +
> +In the first nstat output of client side, the client sent a packet, server

                             of the client side,                       the server

> +reply an ack, when kernel handled this ack, the fast path was not

   replied with an ack; when the kernel handled this ack,

> +enabled, so the ack was counted into 'TcpExtTCPPureAcks'.
> +In the second nstat output of client side, the client sent a packet again,

                                 the client side,

> +and received another ack from the server, this time, the fast path is

                            from the server; this time,

> +enabled, and the ack was qualified for fast path, so it was handled by
> +the fast path, so this ack was counted into TcpExtTCPHPAcks.
> +In the first nstat output of server side, the fast path was not enabled,

                             of the server side,

> +so there was no 'TcpExtTCPHPHits'.
> +In the second nstat output of server side, the fast path was enabled,

                              of the server side,

> +and the packet received from client qualified for fast path, so it

                           from the client qualified for the fast path,

> +was counted into 'TcpExtTCPHPHits'.
> +
> +tcp abort
> +========

TCP abort
=========

> +
> +Some counters indicate the reaons why tcp layer want to send a rst,

                                     why the TCP layer may want to send a RST.

What is "rst"?  or "RST" in RFCs?


> +they are:

   The are:

> +* TcpExtTCPAbortOnData
> +* TcpExtTCPAbortOnClose
> +* TcpExtTCPAbortOnMemory
> +* TcpExtTCPAbortOnTimeout
> +* TcpExtTCPAbortOnLinger
> +* TcpExtTCPAbortFailed
> +
> +TcpExtTCPAbortOnData
> +-------------------
   --------------------
(same length as text above it)

> +
> +It means tcp layer has data in flight, but need to close the

      means the TCP layer                     needs

> +connection. So tcp layer sends a rst to the other side, indicate the

                  the TCP layer                            indicating

> +connection is not closed very graceful. An easy way to increase this

              is not closed gracefully.

> +counter is using the SO_LINGER option. Please refer to the SO_LINGER
> +section of the socket man page:
> +
> +http://man7.org/linux/man-pages/man7/socket.7.html).
> +
> +By default, when an application closes a connection, the close function
> +will return immediately and kernel will try to send the in-flight data

                           and the kernel

> +async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
> +to a positive number, the close function won't return immediately, but
> +wait for the in-flight data are acked by the other side, the max wait

                          data to be acked by the other side;

> +time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
> +when the application closes a connection, kernel will send an rst

                                             the kernel will send an RST

> +immediately, and increase the TcpExtTCPAbortOnData counter.
> +
> +We run nc on the server side::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +Run below python code on the client side::

   Run the below

> +
> +  import socket
> +  import struct
> +    
> +  server = 'nstat-b' # server address
> +  port = 9000
> +    
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))
> +  s.connect((server, port))
> +  s.close()
> +
> +On client side, we could see TcpExtTCPAbortOnData increased::

   On the client side,

> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnData            1                  0.0
> +
> +If we capture packet by tcpdump, we could see the client send rst

                 packets                                         RST

> +instead of fin.

              FIN.

> +
> +
> +TcpExtTCPAbortOnClose
> +--------------------

   ---------------------

> +
> +This counter means the tcp layer has unread data when an application

                          TCP

> +want to close a connection.

   wants

> +
> +On the server side, we run below python script:

                          run the below

> +
> +  import socket
> +  import time
> +  
> +  port = 9000
> +  
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(1)
> +  sock, addr = s.accept()
> +  while True:
> +      time.sleep(9999999)
> +
> +This python script listen on 9000 port, but doesn't read anything from

                      listens on port 9000,

> +the connection.
> +
> +On the client side, we send the string "hello" by nc:
> +
> +  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
> +
> +Then, we come back to the server side, the server has received the "hello"

                                    side;

> +packet, and tcp layer has acked this packet, but the application didn't

           and the TCP layer

> +read it yet. We type Ctrl-C to terminate the server script. Then we
> +could find TcpExtTCPAbortOnClose increased 1 on the server side:
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnClose           1                  0.0
> +
> +If we run tcpdump on the server side, we could find the server sent a
> +rst after we type Ctrl-C.

   RST

> +
> +TcpExtTCPAbortOnMemory
> +--------------------

   ----------------------

> +
> +When an application closes a tcp connection, kernel still need to track

                                TCP             the kernel still needs

> +the connection, let it complete the tcp disconnect process. E.g. an

                                       TCP

> +app calls the close method of a socket, kernel sends fin to the other

                                           the kernel sends FIN

> +side of the connection, then the app has no relationship with the
> +socket any more, but kernel need to keep the socket, this socket

                    but the kernel needs

> +becomes an orphan socket, kernel waits for the reply of the other side,

                             the kernel

> +and would come to the TIME_WAIT state finally. When kernel has no

                                                       the kernel does not have

> +enough memory to keep the orphan socket, kernel would send an rst to

                                            the kernel sends an RST to

> +the other side, and delete the socket, in such situation, kernel will

                       deletes the socket. In such situation, the kernel will

> +increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
> +TcpExtTCPAbortOnMemory:
> +
> +* the memory used by tcp protocol is higher than the third value of

                     by the TCP
> +the tcp_mem. Please refer the tcp_mem section in the tcp man page:
> +
> +http://man7.org/linux/man-pages/man7/tcp.7.html
> +
> +* the orphan socket count is higher than net.ipv4.tcp_max_orphans
> +
> +Below is an example which let the orphan socket count be higher than

                             lets

> +net.ipv4.tcp_max_orphans.
> +
> +Change tcp_max_orphans to a smaller value on client::

                                             on the client::

> +
> +  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
> +
> +Client code (create 64 connection to server)::

                       64 connections to the server)::

> +
> +  nstatuser@nstat-a:~$ cat client_orphan.py
> +  import socket
> +  import time
> +  
> +  server = 'nstat-b' # server address
> +  port = 9000
> +  
> +  count = 64
> +  
> +  connection_list = []
> +  
> +  for i in range(64):
> +      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +      s.connect((server, port))
> +      connection_list.append(s)
> +      print("connection_count: %d" % len(connection_list))
> +  
> +  while True:
> +      time.sleep(99999)
> +
> +Server code (accept 64 connection from client)::

                       64 connections from the client)::

> +
> +  nstatuser@nstat-b:~$ cat server_orphan.py
> +  import socket
> +  import time
> +  
> +  port = 9000
> +  count = 64
> +  
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(count)
> +  connection_list = []
> +  while True:
> +      sock, addr = s.accept()
> +      connection_list.append((sock, addr))
> +      print("connection_count: %d" % len(connection_list))
> +
> +Run the python scripts on server and client.
> +
> +On server::
> +
> +  python3 server_orphan.py
> +
> +On client::
> +
> +  python3 client_orphan.py
> +
> +Run iptables on server::
> +
> +  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
> +
> +Type Ctrl-C on client, stop client_orphan.py.

                          stopping

> +
> +Check TcpExtTCPAbortOnMemory on client::

                                on the client::

> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnMemory          54                 0.0
> +
> +Check orphane socket count on client::

         orphaned socket count on the client::

> +
> +  nstatuser@nstat-a:~$ ss -s
> +  Total: 131 (kernel 0)
> +  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
> +  
> +  Transport Total     IP        IPv6
> +  *         0         -         -
> +  RAW       1         0         1
> +  UDP       1         1         0
> +  TCP       14        13        1
> +  INET      16        14        2
> +  FRAG      0         0         0
> +
> +The explanation of the test: after run server_orphan.py and

                                      running

> +client_orphan.py, we set up 64 connections between server and
> +client. Run the iptables command, the server will drop all packets from

                            command. The server

> +the client, type Ctrl-C on client_orphan.py, the system of the client

       client. Type

> +would try to close these connections, and before they are closed
> +gracefully, these connections became orphan sockets. As the iptables
> +of the server blocked packets from the client, the server won't receive fin

                                                                           FIN

> +from the client, so all connection on clients would be stuck on FIN_WAIT_1
> +stage, so they will keep as orphan sockets until timeout. We have echo

                  will be kept as                            We have echo-ed

> +10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
> +only keep 10 orphan sockets, for all other orphan sockets, the client

                       sockets;

> +system sent rst for them and delete them. We have 64 connections, so

               RST          and deleted them.

> +the 'ss -s' command shows the system has 10 orphan sockets, and the
> +value of TcpExtTCPAbortOnMemory was 54.
> +
> +An additional explanation about orphan socket count: You could find the
> +exactly orphan socket count by the 'ss -s' command, but when kernel

   exact                                               but when the kernel

> +decide whither increases TcpExtTCPAbortOnMemory and sends rst, kernel

   decides whether to increase                               RST, the kernel

> +doesn't always check the exactly orphan socket count. For increasing

                            exact

> +performance, kernel checks an approximate count firstly, if the

                the kernel                         first; if the

> +approximate count is more than tcp_max_orphans, kernel checks the

                                                   the kernel checks the

> +exact count again. So if the approximate count is less than
> +tcp_max_orphans, but exactly count is more than tcp_max_orphans, you

                    but the exact count

> +would find TcpExtTCPAbortOnMemory is not increased at all. If
> +tcp_max_orphans is large enough, it won't occur, but if you decrease
> +tcp_max_orphans to a small value like our test, you might find this
> +issue. So in our test, the client set up 64 connections although the

                                                           although [drop "the"]

> +tcp_max_orphans is 10. If the client only set up 11 connections, we
> +can't find the change of TcpExtTCPAbortOnMemory.
> +
> +TcpExtTCPAbortOnTimeout
> +----------------------

   -----------------------

> +This counter will increase when any of the tcp timers expire. In this

                                              TCP

> +situation, kernel won't send rst, just give up the connection.

              the kernel        RST; it just gives up the connection.

> +Continue the previous test, we wait for several minutes, because the

   Continuing

> +iptables on the server blocked the traffic, the server wouldn't receive
> +fin, and all the client's orphan sockets would timeout on the

   FIN,

> +FIN_WAIT_1 state finally. So we wait for a few minutes, we could find

                             So if we wait for a few minutes, we can find

> +10 timeout on the client::

   10 timeouts

> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnTimeout         10                 0.0
> +
> +TcpExtTCPAbortOnLinger
> +---------------------

   ----------------------

> +When a tcp connection comes into FIN_WAIT_2 state, instead of waiting

          TCP

> +for the fin packet from the other side, kernel could send a rst and

           FIN                             the kernel        an RST

> +delete the socket immediately. This is not the default behavior of
> +linux kernel tcp stack, but after configuring socket option, you could

   the Linux kernel TCP stack,

> +let kernel follow this behavior. Below is an example.

   find the kernel following this behavior.

> +
> +The server side code::
> +
> +  nstatuser@nstat-b:~$ cat server_linger.py
> +  import socket
> +  import time
> +  
> +  port = 9000
> +  
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(1)
> +  sock, addr = s.accept()
> +  while True:
> +      time.sleep(9999999)
> +
> +The client side code::
> +
> +  nstatuser@nstat-a:~$ cat client_linger.py 
> +  import socket
> +  import struct
> +  
> +  server = 'nstat-b' # server address
> +  port = 9000
> +  
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
> +  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))                   
> +  s.connect((server, port))
> +  s.close()
> +
> +Run server_linger.py on server::
> +
> +  nstatuser@nstat-b:~$ python3 server_linger.py 
> +
> +Run client_linger.py on client::
> +
> +  nstatuser@nstat-a:~$ python3 client_linger.py
> +
> +After run client_linger.py, check the output of nstat::

         running

> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnLinger          1                  0.0
> +
> +TcpExtTCPAbortFailed
> +-------------------

   --------------------

> +The kernel tcp layer will send rst if the RFC 2525 2.17 section is satisfied:

              TCP                 RST

> +
> +https://tools.ietf.org/html/rfc2525#page-50
> +
> +If an internal error occurs during this process, TcpExtTCPAbortFailed
> +will be increased.
> +
> +TcpExtListenOverflows and TcpExtListenDrops
> +========================================

   ===========================================

> +When kernel receive a syn from a client, and if the tcp accept queue

        the kernel receives a SYN                      TCP

> +is full, kernel will drop the syn and add 1 to TcpExtListenOverflows.

            the kernel           SYN

> +At the same time kernel will also add 1 to TcpExtListenDrops. When

                    the kernel

> +a tcp socket is in LISTEN state, and kernel need to drop a packet,

     TCP                                the kernel needs

> +kernel would always add 1 to TcpExtListenDrops. So increase

   the kernel

> +TcpExtListenOverflows would let TcpExtListenDrops increasing at the

                                                     increase at

> +same time, but TcpExtListenDrops would also increase without
> +TcpExtListenOverflows increasing, e.g. a memory allocation fail would

                                                              failure would

> +also let TcpExtListenDrops increase.
> +
> +Note: The above explain bases on kernel 4.15 or above version, on an

                   explanation is based on kernel version 4.15 or above; on an

> +old kernel, the tcp stack has different behavior when tcp accept queue

   older           TCP                              when the TCP

> +is full. On the old kernel, tcp stack won't drop the syn, it would

                   older       the TCP stack            SYN; it would

> +complete the 3-way handshake, but as the accept queue is full, tcp

                                                                  the TCP

> +stack will keep the socket in the tcp half-open queue. As it is in the

                                     TCP

> +half open queue, tcp stack will send syn+ack on an exponential backoff

                    the TCP             SYN+ACK

> +timer, after client replies ack, tcp stack checks whether the accept

   timer. After the client replies with ACK, the TCP stack

> +queue is still full, if it is not full, move the socket to accept

                  full. If it                       socket to the accept

> +queue, if it is full, keeps the socket in the half-open queue, at next

   queue. If it is full, keep the socket                   queue. The next


> +time client replies ack, this socket will get another chance to move

   time the client replies with ACK,

> +to the accept queue.
> +
> +Here is an example:
> +
> +On server, run the nc command, listen on port 9000::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client, run 3 nc commands in different terminals::
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +
> +The nc command only accepts 1 connection, and the accept queue length
> +is 1. On current linux implementation, set queue length to n means the

            the current Linux

> +actual queue length is n+1. Now we create 3 connections, 1 is accepted
> +by nc, 2 in accepted queue, so the accept queue is full.
> +
> +Before running the 4th nc, we clean the nstat history on the server:
> +
> +  nstatuser@nstat-b:~$ nstat -n
> +
> +Run the 4th nc on the client:
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +
> +If the nc server is running on kernel 4.15 or higher version, you
> +won't see the "Connection to ... succeeded!" string, because kernel

                                                                the kernel

> +will drop the syn if the accept queue is full. If the nc client is running

                 SYN

> +on an old kernel, you could see that the connection is succeeded,

         older                     that the connection succeeded,

> +because kernel would complete the 3-way handshake and keep the socket

           the kernel

> +on the half-open queue.
> +
> +Our test is on kernel 4.15, run nstat on the server:

                         4.15. Run nstat on the server:

> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    4                  0.0
> +  IpInDelivers                    4                  0.0
> +  TcpInSegs                       4                  0.0
> +  TcpExtListenOverflows           4                  0.0
> +  TcpExtListenDrops               4                  0.0
> +  IpExtInOctets                   240                0.0
> +  IpExtInNoECTPkts                4                  0.0
> +
> +We can see both TcpExtListenOverflows and TcpExtListenDrops are 4. If
> +the time between the 4th nc and the nstat is longer, the value of
> +TcpExtListenOverflows and TcpExtListenDrops will be larger, because
> +the syn of the 4th nc is dropped, it keeps retrying.

       SYN


That last sentence needs to be split up some to be more clear.
Use one or more periods ('.') and not so many commas.


-- 
~Randy

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-11-11  2:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-09 18:13 [PATCH] add an initial version of snmp_counter.rst yupeng
2018-11-10  2:20 ` Cong Wang
2018-11-10 16:56 ` Randy Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.