All of lore.kernel.org
 help / color / mirror / Atom feed
* UDP multicast packet loss not reported if TX ring overrun?
@ 2009-08-17 20:01 Christoph Lameter
  2009-08-17 20:40 ` Nivedita Singhvi
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-17 20:01 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

If I use a large send queue

(echo 2000000 >/proc/sys/net/core/wmem_default)

then lots of packet loss results if I try to send more packets than
the bandwidth of the wire (300 bytes the maximum rate is 341k pps). With a
size of 2M for the output buffer the TX ring is overrun.

But this loss is nowhere seen in any counter increments:

Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.111

TotalMsg   Lost SeqErr Msg/Sec   Min/us  Avg/us  Max/us StdDev  Kbytes Idle  Samples
 3118667 532979  25341  308774 1734.96 1784.72 1904.04   44.11     0.0   0       10
 3415674 585146  27642  341569 1788.90 1858.40 1905.43   34.32 102470.4    0       10
 3449840 591012  27844  341568 1941.12 1987.45 2040.16   32.73 102470.4    0        9
 3449819 591093  27993  341567 2024.34 2036.00 2044.24    6.48 102470.0    0        5
 3415693 585268  27633  341568 2010.57 2017.84 2025.10    7.27 102470.4    0        2

ifconfig
eth0      Link encap:Ethernet  HWaddr 00:21:9b:8f:a1:40
          inet addr:10.2.36.110  Bcast:10.2.36.255  Mask:255.255.255.0
          inet6 addr: fe80::221:9bff:fe8f:a140/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:174716487 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1379 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:60451720777 (56.3 GiB)  TX bytes:225841 (220.5 KiB)
          Interrupt:36 Memory:d6000000-d6012800


If I then reduce the queue size to 20k

(echo 20000 >/proc/sys/net/core/wmem_default)

the loss does either not occur or I see corresponding loss in the switch
stats:

clameter@rd-strategy3:~$ bin/mcast
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.111

TotalMsg   Lost SeqErr Msg/Sec   Min/us  Avg/us  Max/us StdDev  Kbytes
Idle  Samples
 3126997      0      0  309598  109.57  191.69  404.73   79.55     0.0    0       10
 3449790      5      5  341566  238.60  324.98  407.74   57.31 102469.6    0        9
 3449843     94     94  341569  412.31  412.32  412.33    0.01 102470.4    0        2
 3449859     76     76  341569  379.78  399.74  419.65   16.28 102470.6    0        3
 3415644     92     92  341569  408.63  414.63  417.93    4.25 102470.5    0        3
 3413885      3      3  341386   93.53  145.58  207.28   36.85 102415.8    0       10
 3449832      0      0  341568  207.06  282.44  348.13   45.80 102470.4    0       10


The TX ring is not overrun in that case since the queue size reduces the
maximum objects in flight to below 255 (TX ring size of broadcom). So the
application is throttled.


It seems that Linux does not report the UDP packet loss above that is due
to overrunning the TX ring? Why is that?




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 20:01 UDP multicast packet loss not reported if TX ring overrun? Christoph Lameter
@ 2009-08-17 20:40 ` Nivedita Singhvi
  2009-08-17 20:46   ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Nivedita Singhvi @ 2009-08-17 20:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: netdev, Eric Dumazet

Christoph Lameter wrote:
> If I use a large send queue
> 
> (echo 2000000 >/proc/sys/net/core/wmem_default)
> 
> then lots of packet loss results if I try to send more packets than
> the bandwidth of the wire (300 bytes the maximum rate is 341k pps). With a
> size of 2M for the output buffer the TX ring is overrun.
> 
> But this loss is nowhere seen in any counter increments:

Is there any chance this is getting dropped at qdisc?
Can you check with tc?

thanks,
Nivedita

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 20:40 ` Nivedita Singhvi
@ 2009-08-17 20:46   ` Christoph Lameter
  2009-08-17 21:50     ` Sridhar Samudrala
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-17 20:46 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: netdev, Eric Dumazet

On Mon, 17 Aug 2009, Nivedita Singhvi wrote:

> Is there any chance this is getting dropped at qdisc?
> Can you check with tc?

I checked and the counter there is zero.

:/home/clameter# tc -s qdisc show

qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1
1 1 1 1 1
 Sent 3042712708 bytes 8896862 pkt (dropped 0, overlimits 0 requeues 56303)
 rate 0bit 0pps backlog 0b 0p requeues 56303





^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 20:46   ` Christoph Lameter
@ 2009-08-17 21:50     ` Sridhar Samudrala
  2009-08-17 22:13       ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-17 21:50 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 2009-08-17 at 16:46 -0400, Christoph Lameter wrote:
> On Mon, 17 Aug 2009, Nivedita Singhvi wrote:
> 
> > Is there any chance this is getting dropped at qdisc?
> > Can you check with tc?
> 
> I checked and the counter there is zero.
> 
> :/home/clameter# tc -s qdisc show
> 
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1
> 1 1 1 1 1
>  Sent 3042712708 bytes 8896862 pkt (dropped 0, overlimits 0 requeues 56303)
>  rate 0bit 0pps backlog 0b 0p requeues 56303

What about ethtool -S ? Does it report any errors?

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 21:50     ` Sridhar Samudrala
@ 2009-08-17 22:13       ` Christoph Lameter
  2009-08-17 22:43         ` Sridhar Samudrala
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-17 22:13 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 17 Aug 2009, Sridhar Samudrala wrote:

> What about ethtool -S ? Does it report any errors?

Neither. This is is a broadcom bnx2 NIC.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 22:13       ` Christoph Lameter
@ 2009-08-17 22:43         ` Sridhar Samudrala
  2009-08-17 22:52           ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-17 22:43 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 2009-08-17 at 18:13 -0400, Christoph Lameter wrote:
> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> 
> > What about ethtool -S ? Does it report any errors?
> 
> Neither. This is is a broadcom bnx2 NIC.

Are you sure the packets are dropped at the sender?
Another place where packet drops/errors are counted is
    /proc/net/softnet_stat
It tracks some counters that could result in drops. I thought these are
all receive statistics. But looks like cpu_collision is a tx stat. The
name of the structure is netif_rx_stat and it includes cpu_collison 
counter.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 22:43         ` Sridhar Samudrala
@ 2009-08-17 22:52           ` Christoph Lameter
  2009-08-17 22:57             ` Christoph Lameter
                               ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-17 22:52 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 17 Aug 2009, Sridhar Samudrala wrote:

> On Mon, 2009-08-17 at 18:13 -0400, Christoph Lameter wrote:
> > On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> >
> > > What about ethtool -S ? Does it report any errors?
> >
> > Neither. This is is a broadcom bnx2 NIC.
>
> Are you sure the packets are dropped at the sender?

Yes I am sending 400k messages from the app and the receiver only gets
341k @300 byte (which is the line rate). There is no way that the 400k get
over the line. Also if I reduce SO_SNDBUF then both receiver and
sender get down to 341k.

I added the output of ethtool -S at the end.

The mcast tool can be had from http://gentwo.org/ll or from my directory
on www.kernel.org.

> Another place where packet drops/errors are counted is
>     /proc/net/softnet_stat
> It tracks some counters that could result in drops. I thought these are
> all receive statistics. But looks like cpu_collision is a tx stat. The
> name of the structure is netif_rx_stat and it includes cpu_collison
> counter.

How do I decode that information?


clameter@rd-gateway3:~$ cat /proc/net/softnet_stat
00000a2f 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
0000000c 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000

rd-gateway3:/home/clameter# ethtool -S eth0
NIC statistics:
     rx_bytes: 380526
     rx_error_bytes: 0
     tx_bytes: 6299438984
     tx_error_bytes: 0
     rx_ucast_packets: 2373
     rx_mcast_packets: 168
     rx_bcast_packets: 5
     tx_ucast_packets: 1972
     tx_mcast_packets: 18205522
     tx_bcast_packets: 0
     tx_mac_errors: 0
     tx_carrier_errors: 0
     rx_crc_errors: 0
     rx_align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     tx_deferred: 0
     tx_excess_collisions: 0
     tx_late_collisions: 0
     tx_total_collisions: 0
     rx_fragments: 0
     rx_jabbers: 0
     rx_undersize_packets: 0
     rx_oversize_packets: 0
     rx_64_byte_packets: 375
     rx_65_to_127_byte_packets: 1857
     rx_128_to_255_byte_packets: 168
     rx_256_to_511_byte_packets: 39
     rx_512_to_1023_byte_packets: 12
     rx_1024_to_1522_byte_packets: 95
     rx_1523_to_9022_byte_packets: 0
     tx_64_byte_packets: 221
     tx_65_to_127_byte_packets: 1413
     tx_128_to_255_byte_packets: 240
     tx_256_to_511_byte_packets: 18205519
     tx_512_to_1023_byte_packets: 18
     tx_1024_to_1522_byte_packets: 83
     tx_1523_to_9022_byte_packets: 0
     rx_xon_frames: 0
     rx_xoff_frames: 0
     tx_xon_frames: 0
     tx_xoff_frames: 0
     rx_mac_ctrl_frames: 0
     rx_filtered_packets: 16002
     rx_discards: 0
     rx_fw_discards: 0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 22:52           ` Christoph Lameter
@ 2009-08-17 22:57             ` Christoph Lameter
  2009-08-18  0:12             ` Sridhar Samudrala
  2009-08-24 23:14             ` Eric Dumazet
  2 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-17 22:57 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

Argh I guess you want the receiver. The last msg was the sender.

clameter@rd-strategy3:~$ cat /proc/net/softnet_stat
0115d345 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
0000000c 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 22:52           ` Christoph Lameter
  2009-08-17 22:57             ` Christoph Lameter
@ 2009-08-18  0:12             ` Sridhar Samudrala
  2009-08-18  0:25               ` Christoph Lameter
  2009-08-24 17:40               ` Christoph Lameter
  2009-08-24 23:14             ` Eric Dumazet
  2 siblings, 2 replies; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-18  0:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 2009-08-17 at 18:52 -0400, Christoph Lameter wrote:
> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> 
> > On Mon, 2009-08-17 at 18:13 -0400, Christoph Lameter wrote:
> > > On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> > >
> > > > What about ethtool -S ? Does it report any errors?
> > >
> > > Neither. This is is a broadcom bnx2 NIC.
> >
> > Are you sure the packets are dropped at the sender?
> 
> Yes I am sending 400k messages from the app and the receiver only gets
> 341k @300 byte (which is the line rate). There is no way that the 400k get
> over the line. Also if I reduce SO_SNDBUF then both receiver and
> sender get down to 341k.
> 
> I added the output of ethtool -S at the end.
> 
> The mcast tool can be had from http://gentwo.org/ll or from my directory
> on www.kernel.org.
> 
> > Another place where packet drops/errors are counted is
> >     /proc/net/softnet_stat
> > It tracks some counters that could result in drops. I thought these are
> > all receive statistics. But looks like cpu_collision is a tx stat. The
> > name of the structure is netif_rx_stat and it includes cpu_collison
> > counter.
> 
> How do I decode that information?
  total dropped time_squeeze 0 0 0 0 0 cpu_collision

The first 3 are rx stats and the last one is a tx stat.
Anyway, only the first field(total packets received) seems to be non-zero 
in your softnet_stat output on both sender and receiver.
So it is possible that there is some other place in the stack where the packets 
are gettting dropped but not counted.

-Sridhar



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-18  0:12             ` Sridhar Samudrala
@ 2009-08-18  0:25               ` Christoph Lameter
  2009-08-24 17:40               ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-18  0:25 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 17 Aug 2009, Sridhar Samudrala wrote:

> The first 3 are rx stats and the last one is a tx stat.
> Anyway, only the first field(total packets received) seems to be non-zero
> in your softnet_stat output on both sender and receiver.
> So it is possible that there is some other place in the stack where the packets
> are gettting dropped but not counted.

Is the driver responsible? I noticed no rx_drop++ in there.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-18  0:12             ` Sridhar Samudrala
  2009-08-18  0:25               ` Christoph Lameter
@ 2009-08-24 17:40               ` Christoph Lameter
  2009-08-24 22:02                 ` Eric Dumazet
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-24 17:40 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Nivedita Singhvi, netdev, Eric Dumazet

On Mon, 17 Aug 2009, Sridhar Samudrala wrote:

> So it is possible that there is some other place in the stack where the packets
> are gettting dropped but not counted.

Such a deed occurs in ip_push_pending_frames():

        /* Netfilter gets whole the not fragmented skb. */
        err = ip_local_out(skb);
        if (err) {
                if (err > 0)
                        err = inet->recverr ? net_xmit_errno(err) : 0;
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                if (err)
                        goto error;
        }

out:
        ip_cork_release(inet);
        return err;

error:
        IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
        goto out;


So if ip_local_out returns NET_XMIT_DROP then its simply going to be
replaced by 0. Then we check err again and there is no error!!!!

The statistics are only generated if IP_RECVERR is set.

Could we move the increment of IPSTATS_MIB_OUTDISCARDS up so that it
is incremented regardless of the setting of IP_RECVERR?

F.e?


Subject: Report TX drops

Incrementing of TX drop counters currently does not work if errors from the
network stack are suppressed (IP_RECVERR off). Increment the statistics
independently of the setting of IP_RECVERR.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 net/ipv4/ip_output.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Index: linux-2.6/net/ipv4/ip_output.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_output.c	2009-08-24 17:04:27.000000000 +0000
+++ linux-2.6/net/ipv4/ip_output.c	2009-08-24 17:32:05.000000000 +0000
@@ -1300,20 +1300,21 @@ int ip_push_pending_frames(struct sock *

 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
-	if (err) {
-		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
-		if (err)
-			goto error;
+	if (err > 0) {
+		/* The packet was dropped by the network subsystem */
+		IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+
+		/*
+		 * Errors are not passed on if the socket
+		 * does not process errors (see IP_RECVERR).
+		 * net_xmit_errno filters NET_XMIT_CN.
+		 */
+		err = inet->recverr ? net_xmit_errno(err) : 0;
 	}

 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }

 /*




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-24 17:40               ` Christoph Lameter
@ 2009-08-24 22:02                 ` Eric Dumazet
  2009-08-24 22:36                   ` Sridhar Samudrala
  2009-08-25 13:46                   ` Christoph Lameter
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-08-24 22:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> 
>> So it is possible that there is some other place in the stack where the packets
>> are gettting dropped but not counted.
> 
> Such a deed occurs in ip_push_pending_frames():
> 
>         /* Netfilter gets whole the not fragmented skb. */
>         err = ip_local_out(skb);
>         if (err) {
>                 if (err > 0)
>                         err = inet->recverr ? net_xmit_errno(err) : 0;
> 			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 if (err)
>                         goto error;
>         }
> 
> out:
>         ip_cork_release(inet);
>         return err;
> 
> error:
>         IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
>         goto out;
> 
> 
> So if ip_local_out returns NET_XMIT_DROP then its simply going to be
> replaced by 0. Then we check err again and there is no error!!!!
> 
> The statistics are only generated if IP_RECVERR is set.
> 
> Could we move the increment of IPSTATS_MIB_OUTDISCARDS up so that it
> is incremented regardless of the setting of IP_RECVERR?
> 
> F.e?
> 
> 
> Subject: Report TX drops
> 
> Incrementing of TX drop counters currently does not work if errors from the
> network stack are suppressed (IP_RECVERR off). Increment the statistics
> independently of the setting of IP_RECVERR.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  net/ipv4/ip_output.c |   19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/net/ipv4/ip_output.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/ip_output.c	2009-08-24 17:04:27.000000000 +0000
> +++ linux-2.6/net/ipv4/ip_output.c	2009-08-24 17:32:05.000000000 +0000
> @@ -1300,20 +1300,21 @@ int ip_push_pending_frames(struct sock *
> 
>  	/* Netfilter gets whole the not fragmented skb. */
>  	err = ip_local_out(skb);
> -	if (err) {
> -		if (err > 0)
> -			err = inet->recverr ? net_xmit_errno(err) : 0;
> -		if (err)
> -			goto error;
> +	if (err > 0) {
> +		/* The packet was dropped by the network subsystem */
> +		IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> +
> +		/*
> +		 * Errors are not passed on if the socket
> +		 * does not process errors (see IP_RECVERR).
> +		 * net_xmit_errno filters NET_XMIT_CN.
> +		 */
> +		err = inet->recverr ? net_xmit_errno(err) : 0;
>  	}
> 
>  out:
>  	ip_cork_release(inet);
>  	return err;
> -
> -error:
> -	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  /*
> 
> 
> 
> 

NET_XMIT_CN strikes again :)

Well, if ip_local_out() returns a negative error (say -EPERM for example),
 your patch disables OUTDISCARDS increments.

Maybe a simpler patch like this one ?

[PATCH] net: correctly updates OUTDISCARDS in ip_push_pending_frames()

ip_push_pending_frames() can fail to send a frame because of a congestioned
device. In this case, we increment SNMP OUTDISCARDS only if user set
IP_RECVERR, which is not RFC conformant.

Only case where we should not update OUTDISCARDS is when
ip_local_output() return value is NET_XMIT_CN (meaning
skb was xmitted but future frames might be dropped)

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..27a5b79 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1301,19 +1301,15 @@ int ip_push_pending_frames(struct sock *sk)
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
+		if (err != NET_XMIT_CN)
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 		if (err > 0)
 			err = inet->recverr ? net_xmit_errno(err) : 0;
-		if (err)
-			goto error;
 	}
 
 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-24 22:02                 ` Eric Dumazet
@ 2009-08-24 22:36                   ` Sridhar Samudrala
  2009-08-25 13:48                     ` Christoph Lameter
  2009-08-25 13:46                   ` Christoph Lameter
  1 sibling, 1 reply; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-24 22:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, Nivedita Singhvi, netdev, David S. Miller

On Tue, 2009-08-25 at 00:02 +0200, Eric Dumazet wrote:
> Christoph Lameter a écrit :
> > On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> > 
> >> So it is possible that there is some other place in the stack where the packets
> >> are gettting dropped but not counted.
> > 
> > Such a deed occurs in ip_push_pending_frames():
> > 
> >         /* Netfilter gets whole the not fragmented skb. */
> >         err = ip_local_out(skb);
> >         if (err) {
> >                 if (err > 0)
> >                         err = inet->recverr ? net_xmit_errno(err) : 0;
> > 			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                 if (err)
> >                         goto error;
> >         }
> > 
> > out:
> >         ip_cork_release(inet);
> >         return err;
> > 
> > error:
> >         IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> >         goto out;
> > 
> > 
> > So if ip_local_out returns NET_XMIT_DROP then its simply going to be
> > replaced by 0. Then we check err again and there is no error!!!!

Christoph,

So are you hitting this case with your workload and does this account for all the
packet losses you are seeing? 

If we are dropping the packet and returing NET_XMIT_DROP, should
we also increment qdisc drop stats (sch->qstats.drops)?

In dev_queue_xmit(), we have

        if (q->enqueue) {
                spinlock_t *root_lock = qdisc_lock(q);

                spin_lock(root_lock);

                if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                        kfree_skb(skb);
                        rc = NET_XMIT_DROP;
                } else {
                        rc = qdisc_enqueue_root(skb, q);
                        qdisc_run(q);
                }
                spin_unlock(root_lock);

                goto out;
        }

Here, if QDISC_STATE_DEACTIVATED is true, the skb is dropped and NET_XMIT_DROP
is returned, but not accounted in qdisc drop stats.
However it is incremented when NET_XMIT_DROP is returned via qdisc_drop().

If we count these drops as qdisc drops, should we also count them as IP OUTDISCARDS?

Thanks
Sridhar

> > 
> NET_XMIT_CN strikes again :)
> 
> Well, if ip_local_out() returns a negative error (say -EPERM for example),
>  your patch disables OUTDISCARDS increments.
> 
> Maybe a simpler patch like this one ?
> 
> [PATCH] net: correctly updates OUTDISCARDS in ip_push_pending_frames()
> 
> ip_push_pending_frames() can fail to send a frame because of a congestioned
> device. In this case, we increment SNMP OUTDISCARDS only if user set
> IP_RECVERR, which is not RFC conformant.
> 
> Only case where we should not update OUTDISCARDS is when
> ip_local_output() return value is NET_XMIT_CN (meaning
> skb was xmitted but future frames might be dropped)
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7d08210..27a5b79 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1301,19 +1301,15 @@ int ip_push_pending_frames(struct sock *sk)
>  	/* Netfilter gets whole the not fragmented skb. */
>  	err = ip_local_out(skb);
>  	if (err) {
> +		if (err != NET_XMIT_CN)
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
>  		if (err > 0)
>  			err = inet->recverr ? net_xmit_errno(err) : 0;
> -		if (err)
> -			goto error;
>  	}
> 
>  out:
>  	ip_cork_release(inet);
>  	return err;
> -
> -error:
> -	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  /*



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-17 22:52           ` Christoph Lameter
  2009-08-17 22:57             ` Christoph Lameter
  2009-08-18  0:12             ` Sridhar Samudrala
@ 2009-08-24 23:14             ` Eric Dumazet
  2009-08-25  6:46               ` Eric Dumazet
  2009-08-25 13:45               ` Christoph Lameter
  2 siblings, 2 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-08-24 23:14 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev

Christoph Lameter a écrit :
> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
> 
>> On Mon, 2009-08-17 at 18:13 -0400, Christoph Lameter wrote:
>>> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
>>>
>>>> What about ethtool -S ? Does it report any errors?
>>> Neither. This is is a broadcom bnx2 NIC.
>> Are you sure the packets are dropped at the sender?
> 
> Yes I am sending 400k messages from the app and the receiver only gets
> 341k @300 byte (which is the line rate). There is no way that the 400k get
> over the line. Also if I reduce SO_SNDBUF then both receiver and
> sender get down to 341k.
> 
> I added the output of ethtool -S at the end.
> 
> The mcast tool can be had from http://gentwo.org/ll or from my directory
> on www.kernel.org.
> 

# gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.4.1/configure --enable-languages=c,c++ --prefix=/usr
Thread model: posix
gcc version 4.4.1 (GCC)
# pwd
/opt/src/lldiag-0.14
# make
gcc -Wall -omcast mcast.c -lrt -lm
mcast.c: In function ‘set_ip’:
mcast.c:121: warning: implicit declaration of function ‘htons’
mcast.c: In function ‘build_pattern_array’:
mcast.c:168: warning: implicit declaration of function ‘htonl’
/tmp/cc4sYCDr.o: In function `lock':
mcast.c:(.text+0xcad): undefined reference to `__sync_fetch_and_add_4'
collect2: ld returned 1 exit status
make: *** [mcast] Error 1


I have no idea where is defined sync_fetch_and_add

Thanks

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-24 23:14             ` Eric Dumazet
@ 2009-08-25  6:46               ` Eric Dumazet
  2009-08-25 13:45               ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25  6:46 UTC (permalink / raw)
  Cc: Christoph Lameter, Sridhar Samudrala, Nivedita Singhvi, netdev

Eric Dumazet a écrit :
> Christoph Lameter a écrit :
>> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
>>
>>> On Mon, 2009-08-17 at 18:13 -0400, Christoph Lameter wrote:
>>>> On Mon, 17 Aug 2009, Sridhar Samudrala wrote:
>>>>
>>>>> What about ethtool -S ? Does it report any errors?
>>>> Neither. This is is a broadcom bnx2 NIC.
>>> Are you sure the packets are dropped at the sender?
>> Yes I am sending 400k messages from the app and the receiver only gets
>> 341k @300 byte (which is the line rate). There is no way that the 400k get
>> over the line. Also if I reduce SO_SNDBUF then both receiver and
>> sender get down to 341k.
>>
>> I added the output of ethtool -S at the end.
>>
>> The mcast tool can be had from http://gentwo.org/ll or from my directory
>> on www.kernel.org.
>>
> 
> # gcc -v
> Using built-in specs.
> Target: i686-pc-linux-gnu
> Configured with: ../gcc-4.4.1/configure --enable-languages=c,c++ --prefix=/usr
> Thread model: posix
> gcc version 4.4.1 (GCC)
> # pwd
> /opt/src/lldiag-0.14
> # make
> gcc -Wall -omcast mcast.c -lrt -lm
> mcast.c: In function ‘set_ip’:
> mcast.c:121: warning: implicit declaration of function ‘htons’
> mcast.c: In function ‘build_pattern_array’:
> mcast.c:168: warning: implicit declaration of function ‘htonl’
> /tmp/cc4sYCDr.o: In function `lock':
> mcast.c:(.text+0xcad): undefined reference to `__sync_fetch_and_add_4'
> collect2: ld returned 1 exit status
> make: *** [mcast] Error 1
> 
> 
> I have no idea where is defined sync_fetch_and_add
> 

Doh...

Dont bother, I had to add "-march=i686" to CFLAGS to get "xadd" support


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-24 23:14             ` Eric Dumazet
  2009-08-25  6:46               ` Eric Dumazet
@ 2009-08-25 13:45               ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 13:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> Thread model: posix
> gcc version 4.4.1 (GCC)
> # pwd
> /opt/src/lldiag-0.14
> # make
> gcc -Wall -omcast mcast.c -lrt -lm
> mcast.c: In function ?set_ip?:
> mcast.c:121: warning: implicit declaration of function ?htons?
> mcast.c: In function ?build_pattern_array?:
> mcast.c:168: warning: implicit declaration of function ?htonl?
> /tmp/cc4sYCDr.o: In function `lock':
> mcast.c:(.text+0xcad): undefined reference to `__sync_fetch_and_add_4'
> collect2: ld returned 1 exit status
> make: *** [mcast] Error 1
>
>
> I have no idea where is defined sync_fetch_and_add

Its a GCC builtin for x86 sigh. Ok will put out 0.15 that disables their
use by default.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-24 22:02                 ` Eric Dumazet
  2009-08-24 22:36                   ` Sridhar Samudrala
@ 2009-08-25 13:46                   ` Christoph Lameter
  2009-08-25 13:48                     ` Eric Dumazet
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 13:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> NET_XMIT_CN strikes again :)
>
> Well, if ip_local_out() returns a negative error (say -EPERM for example),
>  your patch disables OUTDISCARDS increments.
>
> Maybe a simpler patch like this one ?

Yes looks good. Can we get this in soon?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 13:46                   ` Christoph Lameter
@ 2009-08-25 13:48                     ` Eric Dumazet
  2009-08-25 14:00                       ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25 13:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> NET_XMIT_CN strikes again :)
>>
>> Well, if ip_local_out() returns a negative error (say -EPERM for example),
>>  your patch disables OUTDISCARDS increments.
>>
>> Maybe a simpler patch like this one ?
> 
> Yes looks good. Can we get this in soon?
> 

Please hold on, I would like to fully understand what's happening,
and test the patch :)



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-24 22:36                   ` Sridhar Samudrala
@ 2009-08-25 13:48                     ` Christoph Lameter
  2009-08-25 19:03                       ` David Stevens
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 13:48 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Eric Dumazet, Nivedita Singhvi, netdev, David S. Miller

On Mon, 24 Aug 2009, Sridhar Samudrala wrote:

> So are you hitting this case with your workload and does this account for all the
> packet losses you are seeing?

Yes.

> If we are dropping the packet and returing NET_XMIT_DROP, should
> we also increment qdisc drop stats (sch->qstats.drops)?

I think so but I am no expert. I was surprised to not even see counter
increments at that level. But I was content to fix up the higher level
tracking to have at least one counter that showed the packet loss.

> If we count these drops as qdisc drops, should we also count them as IP OUTDISCARDS?

Yes.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 13:48                     ` Eric Dumazet
@ 2009-08-25 14:00                       ` Christoph Lameter
  2009-08-25 15:32                         ` Eric Dumazet
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 14:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> Please hold on, I would like to fully understand what's happening,
> and test the patch :)

Ok. It would be good if the drops would also be somehow noted by the UDP
subsystem (one should see something with netstat -su) and may be even the
socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 14:00                       ` Christoph Lameter
@ 2009-08-25 15:32                         ` Eric Dumazet
  2009-08-25 15:35                           ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25 15:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> Please hold on, I would like to fully understand what's happening,
>> and test the patch :)
> 
> Ok. It would be good if the drops would also be somehow noted by the UDP
> subsystem (one should see something with netstat -su) and may be even the
> socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?

This /proc/net/udp column is for rx_drops currently and was recently added...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 15:32                         ` Eric Dumazet
@ 2009-08-25 15:35                           ` Christoph Lameter
  2009-08-25 15:58                             ` Eric Dumazet
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 15:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> Christoph Lameter a ?crit :
> > On Tue, 25 Aug 2009, Eric Dumazet wrote:
> >
> >> Please hold on, I would like to fully understand what's happening,
> >> and test the patch :)
> >
> > Ok. It would be good if the drops would also be somehow noted by the UDP
> > subsystem (one should see something with netstat -su) and may be even the
> > socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?
>
> This /proc/net/udp column is for rx_drops currently and was recently added...

So lets rename it to rx_drops and then add tx_drops?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 15:35                           ` Christoph Lameter
@ 2009-08-25 15:58                             ` Eric Dumazet
  2009-08-25 16:11                               ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25 15:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> Christoph Lameter a ?crit :
>>> On Tue, 25 Aug 2009, Eric Dumazet wrote:
>>>
>>>> Please hold on, I would like to fully understand what's happening,
>>>> and test the patch :)
>>> Ok. It would be good if the drops would also be somehow noted by the UDP
>>> subsystem (one should see something with netstat -su) and may be even the
>>> socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?
>> This /proc/net/udp column is for rx_drops currently and was recently added...
> 
> So lets rename it to rx_drops and then add tx_drops?
> 

It wont be very nice, because it'll add yet another 32bits counter in each socket
structure, for a unlikely use. While rx_drops can happen if application is slow.

Also, tx_drops might be done later and not noticed.

Please read this old (and usefull) thread, with Alexey words...

http://oss.sgi.com/archives/netdev/2002-10/msg00612.html

http://oss.sgi.com/archives/netdev/2002-10/msg00617.html


So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 15:58                             ` Eric Dumazet
@ 2009-08-25 16:11                               ` Christoph Lameter
  2009-08-25 16:27                                 ` Eric Dumazet
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 16:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> It wont be very nice, because it'll add yet another 32bits counter in each socket
> structure, for a unlikely use. While rx_drops can happen if application is slow.

tx_drops happen if the application sends too fast.

TX drop tracking is important due to the braindamaged throttling logic
during send. If SO_SNDBUF is less than what happens to fit in the TX ring then the
application will be throttled and no packet loss happens. If SO_SNDBUF is
set high then the TX ring will overflow and packets are dropped.

We need some way to diagnose TX drops per socket as long as we have
that mind boggling issue. TX drops means that one should reduce the size
of the sendbuffer in order to get better throttling which reduces packet
loss.

> Also, tx_drops might be done later and not noticed.
>
> Please read this old (and usefull) thread, with Alexey words...
>
> http://oss.sgi.com/archives/netdev/2002-10/msg00612.html
>
> http://oss.sgi.com/archives/netdev/2002-10/msg00617.html
>
>
> So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)

I read this just yesterday. IP_RECVERR means that the application wants to
see details on each loss. We just want some counters that give us accurate
statistics to gauge where packet loss is occurring. Applications are
usually not interested in tracking the fate of each packet.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 16:11                               ` Christoph Lameter
@ 2009-08-25 16:27                                 ` Eric Dumazet
  2009-08-25 16:36                                   ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25 16:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>> It wont be very nice, because it'll add yet another 32bits counter in each socket
>> structure, for a unlikely use. While rx_drops can happen if application is slow.
> 
> tx_drops happen if the application sends too fast.
> 
> TX drop tracking is important due to the braindamaged throttling logic
> during send. If SO_SNDBUF is less than what happens to fit in the TX ring then the
> application will be throttled and no packet loss happens. If SO_SNDBUF is
> set high then the TX ring will overflow and packets are dropped.
> 
> We need some way to diagnose TX drops per socket as long as we have
> that mind boggling issue. TX drops means that one should reduce the size
> of the sendbuffer in order to get better throttling which reduces packet
> loss.
> 
>> Also, tx_drops might be done later and not noticed.
>>
>> Please read this old (and usefull) thread, with Alexey words...
>>
>> http://oss.sgi.com/archives/netdev/2002-10/msg00612.html
>>
>> http://oss.sgi.com/archives/netdev/2002-10/msg00617.html
>>
>>
>> So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)
> 
> I read this just yesterday. IP_RECVERR means that the application wants to
> see details on each loss. We just want some counters that give us accurate
> statistics to gauge where packet loss is occurring. Applications are
> usually not interested in tracking the fate of each packet.

Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
in sending and congestion, which was your initial point :)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 16:27                                 ` Eric Dumazet
@ 2009-08-25 16:36                                   ` Christoph Lameter
  2009-08-25 16:48                                     ` Eric Dumazet
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 16:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> > I read this just yesterday. IP_RECVERR means that the application wants to
> > see details on each loss. We just want some counters that give us accurate
> > statistics to gauge where packet loss is occurring. Applications are
> > usually not interested in tracking the fate of each packet.
>
> Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
> in sending and congestion, which was your initial point :)

The initial point was that the SNMP counters are not updated if IP_RECVERR
is not set which is clearly a bug and your and my patch addresses that.

Then Sridhar noted that there are other tx drop counters. qdisc counters
are also not updated. Wish we would maintain tx drops counters there as
well so that we can track down which NIC drops it.

Then came the wishlist of UDP counters for tx drops and socket based
tx_drop accounting for tuning and tracking down which app is sending
too fast .... ;-)

The apps could be third party apps. Just need to be able to troubleshoot
packet loss.






^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 16:36                                   ` Christoph Lameter
@ 2009-08-25 16:48                                     ` Eric Dumazet
  2009-08-25 17:01                                       ` Christoph Lameter
  2009-08-25 18:38                                       ` Sridhar Samudrala
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25 16:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>>> I read this just yesterday. IP_RECVERR means that the application wants to
>>> see details on each loss. We just want some counters that give us accurate
>>> statistics to gauge where packet loss is occurring. Applications are
>>> usually not interested in tracking the fate of each packet.
>> Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
>> in sending and congestion, which was your initial point :)
> 
> The initial point was that the SNMP counters are not updated if IP_RECVERR
> is not set which is clearly a bug and your and my patch addresses that.

Technically speaking, the send() syscall is in error. Frame is not sent, so
there is no drop at all. Like trying to send() from a bad user buffer, or write()
to a too big file...


> 
> Then Sridhar noted that there are other tx drop counters. qdisc counters
> are also not updated. Wish we would maintain tx drops counters there as
> well so that we can track down which NIC drops it.
> 
> Then came the wishlist of UDP counters for tx drops and socket based
> tx_drop accounting for tuning and tracking down which app is sending
> too fast .... ;-)
> 
> The apps could be third party apps. Just need to be able to troubleshoot
> packet loss.
> 

Question is : should we just allow send() to return an error (-ENOBUF) regardless
of IP_RECVERR being set or not ? I dont think it would be so bad after all.
Most apps probably dont care, or already handle the error.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..afae0cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1302,7 +1302,7 @@ int ip_push_pending_frames(struct sock *sk)
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 87f8419..a7e5f93 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1526,7 +1526,7 @@ int ip6_push_pending_frames(struct sock *sk)
 	err = ip6_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 16:48                                     ` Eric Dumazet
@ 2009-08-25 17:01                                       ` Christoph Lameter
  2009-08-25 17:08                                         ` Eric Dumazet
  2009-08-25 18:38                                       ` Sridhar Samudrala
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 17:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> > The initial point was that the SNMP counters are not updated if IP_RECVERR
> > is not set which is clearly a bug and your and my patch addresses that.
>
> Technically speaking, the send() syscall is in error. Frame is not sent, so
> there is no drop at all. Like trying to send() from a bad user buffer, or write()
> to a too big file...

Frame is submitted to the IP layer which discards it. That is the
definition of an output discard.

> Question is : should we just allow send() to return an error (-ENOBUF) regardless
> of IP_RECVERR being set or not ? I dont think it would be so bad after all.
> Most apps probably dont care, or already handle the error.

Some applications will then start to fail because so far you can send with
impunity without getting errors. AFAICT IP_RECVERR was added to preserve
that behavior. Your patch is changing basic send() semantics.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 17:01                                       ` Christoph Lameter
@ 2009-08-25 17:08                                         ` Eric Dumazet
  2009-08-25 17:44                                           ` Christoph Lameter
  2009-08-25 17:53                                           ` Christoph Lameter
  0 siblings, 2 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-08-25 17:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

Christoph Lameter a écrit :
> On Tue, 25 Aug 2009, Eric Dumazet wrote:
> 
>>> The initial point was that the SNMP counters are not updated if IP_RECVERR
>>> is not set which is clearly a bug and your and my patch addresses that.
>> Technically speaking, the send() syscall is in error. Frame is not sent, so
>> there is no drop at all. Like trying to send() from a bad user buffer, or write()
>> to a too big file...
> 
> Frame is submitted to the IP layer which discards it. That is the
> definition of an output discard.
> 

Last patch accounts for this *error* AFAIK, or did I missed something ?


>> Question is : should we just allow send() to return an error (-ENOBUF) regardless
>> of IP_RECVERR being set or not ? I dont think it would be so bad after all.
>> Most apps probably dont care, or already handle the error.
> 
> Some applications will then start to fail because so far you can send with
> impunity without getting errors. AFAICT IP_RECVERR was added to preserve
> that behavior. Your patch is changing basic send() semantics.

Sorry ???, I guess your machines have plenty available LOWMEM then, and kmalloc() never fail then...

man P send

      ENOBUFS
              Insufficient resources were available in the system to perform the operation.

basic send() semantics are respected.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 17:08                                         ` Eric Dumazet
@ 2009-08-25 17:44                                           ` Christoph Lameter
  2009-08-25 17:53                                           ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 17:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller

On Tue, 25 Aug 2009, Eric Dumazet wrote:

> Christoph Lameter a ?crit :
> > On Tue, 25 Aug 2009, Eric Dumazet wrote:
> >
> >>> The initial point was that the SNMP counters are not updated if IP_RECVERR
> >>> is not set which is clearly a bug and your and my patch addresses that.
> >> Technically speaking, the send() syscall is in error. Frame is not sent, so
> >> there is no drop at all. Like trying to send() from a bad user buffer, or write()
> >> to a too big file...
> >
> > Frame is submitted to the IP layer which discards it. That is the
> > definition of an output discard.
> >
>
> Last patch accounts for this *error* AFAIK, or did I missed something ?

Right.

> >> Question is : should we just allow send() to return an error (-ENOBUF) regardless
> >> of IP_RECVERR being set or not ? I dont think it would be so bad after all.
> >> Most apps probably dont care, or already handle the error.
> >
> > Some applications will then start to fail because so far you can send with
> > impunity without getting errors. AFAICT IP_RECVERR was added to preserve
> > that behavior. Your patch is changing basic send() semantics.
>
> Sorry ???, I guess your machines have plenty available LOWMEM then, and kmalloc() never fail then...

Nope. Currently sendto() just drops the packet and returns success if the
TX ring is full. That can be done ad infinitum and at very high traffic
rates. We had one person here believing he could send 800k 300 byte
packets per second on a 1G wire.... ROTFL.

> basic send() semantics are respected.

basic send() semantics are changed by your patch. The 800k pps would no
longer work without sendto() returning errors.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 17:08                                         ` Eric Dumazet
  2009-08-25 17:44                                           ` Christoph Lameter
@ 2009-08-25 17:53                                           ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 17:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, Nivedita Singhvi, netdev, David S. Miller


Manpage for send says:

ENOBUFS
              The output queue for a network interface was full.  This generally indicates that  the  interface  has  stopped
              sending, but may be caused by transient congestion.  (Normally, this does not occur in Linux.  Packets are just
              silently dropped when a device queue overflows.)


So ENOBUFS seems to be designed to have the role that you envision. We
just need to remove the statement in (). Its still a change in behavior
though.





^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 16:48                                     ` Eric Dumazet
  2009-08-25 17:01                                       ` Christoph Lameter
@ 2009-08-25 18:38                                       ` Sridhar Samudrala
  1 sibling, 0 replies; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-25 18:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, Nivedita Singhvi, netdev, David S. Miller

On Tue, 2009-08-25 at 18:48 +0200, Eric Dumazet wrote:
> Christoph Lameter a écrit :
> > On Tue, 25 Aug 2009, Eric Dumazet wrote:
> > 
> >>> I read this just yesterday. IP_RECVERR means that the application wants to
> >>> see details on each loss. We just want some counters that give us accurate
> >>> statistics to gauge where packet loss is occurring. Applications are
> >>> usually not interested in tracking the fate of each packet.
> >> Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
> >> in sending and congestion, which was your initial point :)
> > 
> > The initial point was that the SNMP counters are not updated if IP_RECVERR
> > is not set which is clearly a bug and your and my patch addresses that.
> 
> Technically speaking, the send() syscall is in error. Frame is not sent, so
> there is no drop at all. Like trying to send() from a bad user buffer, or write()
> to a too big file...
> 
> 
> > 
> > Then Sridhar noted that there are other tx drop counters. qdisc counters
> > are also not updated. Wish we would maintain tx drops counters there as
> > well so that we can track down which NIC drops it.
> > 
> > Then came the wishlist of UDP counters for tx drops and socket based
> > tx_drop accounting for tuning and tracking down which app is sending
> > too fast .... ;-)
> > 
> > The apps could be third party apps. Just need to be able to troubleshoot
> > packet loss.
> > 
> 
> Question is : should we just allow send() to return an error (-ENOBUF) regardless
> of IP_RECVERR being set or not ? I dont think it would be so bad after all.
> Most apps probably dont care, or already handle the error.

This patch would allow tracking drops at UDP level too via UDP_MIB_SNDBUFERRORS
that is incremented in udp_sendmsg(). Right now this happens only if IP_RECVERR
is set on the socket.

Ideally, it would be good to track the drops at qdisc, IP and UDP if a
packet is passed all the way to dev_queue_xmit() and then dropped.

Thanks
Sridhar


> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7d08210..afae0cb 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1302,7 +1302,7 @@ int ip_push_pending_frames(struct sock *sk)
>  	err = ip_local_out(skb);
>  	if (err) {
>  		if (err > 0)
> -			err = inet->recverr ? net_xmit_errno(err) : 0;
> +			err = net_xmit_errno(err);
>  		if (err)
>  			goto error;
>  	}
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 87f8419..a7e5f93 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1526,7 +1526,7 @@ int ip6_push_pending_frames(struct sock *sk)
>  	err = ip6_local_out(skb);
>  	if (err) {
>  		if (err > 0)
> -			err = np->recverr ? net_xmit_errno(err) : 0;
> +			err = net_xmit_errno(err);
>  		if (err)
>  			goto error;
>  	}


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 13:48                     ` Christoph Lameter
@ 2009-08-25 19:03                       ` David Stevens
  2009-08-25 19:08                         ` David Miller
  2009-08-25 19:15                         ` Christoph Lameter
  0 siblings, 2 replies; 73+ messages in thread
From: David Stevens @ 2009-08-25 19:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, Eric Dumazet, netdev, netdev-owner, niv, sri

Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24 
AM:

> On Mon, 24 Aug 2009, Sridhar Samudrala wrote:

> > If we count these drops as qdisc drops, should we also count them as 
IP OUTDISCARDS?
> 
> Yes.

Actually, no. (!)

IP_OUTDISCARDS should count the packets IP dropped, not
anything dropped at a lower layer (which, in general, it
is not aware of). If you count these in multiple layers,
then you don't really know who dropped it.

                                                +-DLS


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 19:03                       ` David Stevens
@ 2009-08-25 19:08                         ` David Miller
  2009-08-25 19:15                         ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: David Miller @ 2009-08-25 19:08 UTC (permalink / raw)
  To: dlstevens; +Cc: cl, eric.dumazet, netdev, netdev-owner, niv, sri

From: David Stevens <dlstevens@us.ibm.com>
Date: Tue, 25 Aug 2009 12:03:58 -0700

> IP_OUTDISCARDS should count the packets IP dropped, not
> anything dropped at a lower layer (which, in general, it
> is not aware of). If you count these in multiple layers,
> then you don't really know who dropped it.

Right.

We are in danger of going from one extreme to the other.
Previously we lacked some drop detection capabilities
but now we've filled most of these holes and ON TOP of
all of that we have Neil's SKB drop tracer.

Let's not get carried away over-accounting this stuff.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 19:03                       ` David Stevens
  2009-08-25 19:08                         ` David Miller
@ 2009-08-25 19:15                         ` Christoph Lameter
  2009-08-25 19:56                           ` Joe Perches
  2009-08-25 22:35                           ` Sridhar Samudrala
  1 sibling, 2 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-25 19:15 UTC (permalink / raw)
  To: David Stevens
  Cc: David S. Miller, Eric Dumazet, netdev, netdev-owner, niv, sri

On Tue, 25 Aug 2009, David Stevens wrote:

> Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24
> AM:
>
> > On Mon, 24 Aug 2009, Sridhar Samudrala wrote:
>
> > > If we count these drops as qdisc drops, should we also count them as
> IP OUTDISCARDS?
> >
> > Yes.
>
> Actually, no. (!)
>
> IP_OUTDISCARDS should count the packets IP dropped, not
> anything dropped at a lower layer (which, in general, it
> is not aware of). If you count these in multiple layers,
> then you don't really know who dropped it.

You are right. I skipped that IP OUTDICARDS reference. They need to be
accounted at the qdisc level though.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 19:15                         ` Christoph Lameter
@ 2009-08-25 19:56                           ` Joe Perches
  2009-08-25 22:35                           ` Sridhar Samudrala
  1 sibling, 0 replies; 73+ messages in thread
From: Joe Perches @ 2009-08-25 19:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev,
	netdev-owner, niv, sri

On Tue, 2009-08-25 at 15:15 -0400, Christoph Lameter wrote:
> On Tue, 25 Aug 2009, David Stevens wrote:
> > Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24
> > AM:
> > > On Mon, 24 Aug 2009, Sridhar Samudrala wrote:
> > > > If we count these drops as qdisc drops, should we also count them as
> > IP OUTDISCARDS?
> > > Yes.
> > Actually, no. (!)
> > IP_OUTDISCARDS should count the packets IP dropped, not
> > anything dropped at a lower layer (which, in general, it
> > is not aware of). If you count these in multiple layers,
> > then you don't really know who dropped it.
> They need to be accounted at the qdisc level though.

It's probably useful to be able to know when packets
and payloads are dropped.  It may not be necessary though.
It's probably not a fundamental.

Chariot, LANforge and apps like it might care, but most all
other apps might not care at all.

Maybe these should be allowed with a CONFIG.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 19:15                         ` Christoph Lameter
  2009-08-25 19:56                           ` Joe Perches
@ 2009-08-25 22:35                           ` Sridhar Samudrala
  2009-08-26 14:08                             ` Christoph Lameter
  2009-08-26 16:29                             ` Christoph Lameter
  1 sibling, 2 replies; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-25 22:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev,
	netdev-owner, niv, sri

On Tue, 2009-08-25 at 15:15 -0400, Christoph Lameter wrote:
> On Tue, 25 Aug 2009, David Stevens wrote:
> 
> > Christoph Lameter <cl@linux-foundation.org> wrote on 08/25/2009 06:48:24
> > AM:
> >
> > > On Mon, 24 Aug 2009, Sridhar Samudrala wrote:
> >
> > > > If we count these drops as qdisc drops, should we also count them as
> > IP OUTDISCARDS?
> > >
> > > Yes.
> >
> > Actually, no. (!)
> >
> > IP_OUTDISCARDS should count the packets IP dropped, not
> > anything dropped at a lower layer (which, in general, it
> > is not aware of). If you count these in multiple layers,
> > then you don't really know who dropped it.
> 
> You are right. I skipped that IP OUTDICARDS reference. They need to be
> accounted at the qdisc level though.

Yes. Now that we agree that drops at dev_queue_xmit level should be counted
under qdisc stats, the following patch should address 1 of the 3 places where
NET_XMIT_DROP is returned, but qdisc drop stats is not incremented.
The other 2 places are in ipsec output functions esp_output and esp6_output.
I am not sure where these drops should be accounted.

Could you check if the UDP packet losses you are seeing are accounted for in
qdisc drops with this patch. But i am not completely positive on this as this
case happens only if qdisc is deactivated.

diff --git a/net/core/dev.c b/net/core/dev.c
index 6a94475..8b6a075 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1864,8 +1864,7 @@ gso:
 		spin_lock(root_lock);
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-			kfree_skb(skb);
-			rc = NET_XMIT_DROP;
+			rc = qdisc_drop(skb, q);
 		} else {
 			rc = qdisc_enqueue_root(skb, q);
 			qdisc_run(q);

Thanks
Sridhar


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 22:35                           ` Sridhar Samudrala
@ 2009-08-26 14:08                             ` Christoph Lameter
  2009-08-26 14:22                               ` Eric Dumazet
  2009-08-26 16:29                             ` Christoph Lameter
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-26 14:08 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev,
	netdev-owner, niv, sri

I also see no UDP stats for packet drops. If Erics fix gets in then
ip_push_pending_frames() will give us an error code on drop.
Incrementing SNDBUFERRORS would require this simple patch.


UDP: Account for TX drops

UDP layer is currently not incrementing error counters when packets are
dropped. Use the SNDBUFERRORS to indicate packet drops on send.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 net/ipv4/udp.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6/net/ipv4/udp.c
===================================================================
--- linux-2.6.orig/net/ipv4/udp.c	2009-08-26 13:21:30.000000000 +0000
+++ linux-2.6/net/ipv4/udp.c	2009-08-26 13:46:35.000000000 +0000
@@ -559,6 +559,13 @@ static int udp_push_pending_frames(struc

 send:
 	err = ip_push_pending_frames(sk);
+
+	if (err)
+		/*
+		 * Packet was dropped.
+		 */
+		UDP_INC_STATS_USER(sock_net(sk),
+			UDP_MIB_SNDBUFERRORS, is_udplite);
 out:
 	up->len = 0;
 	up->pending = 0;

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-26 14:08                             ` Christoph Lameter
@ 2009-08-26 14:22                               ` Eric Dumazet
  2009-08-26 15:27                                 ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-26 14:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, David Stevens, David S. Miller, netdev, niv, sri

Christoph Lameter a écrit :
> I also see no UDP stats for packet drops. If Erics fix gets in then
> ip_push_pending_frames() will give us an error code on drop.
> Incrementing SNDBUFERRORS would require this simple patch.
> 

I think it's already done in udp_sendmsg()

Code starting at line 765 in net/ipv4/udp.c

        /*
         * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
         * ENOBUFS might not be good (it's not tunable per se), but otherwise
         * we don't have a good statistic (IpOutDiscards but it can be too many
         * things).  We could add another new stat but at least for now that
         * seems like overkill.
         */
        if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
                UDP_INC_STATS_USER(sock_net(sk),
                                UDP_MIB_SNDBUFERRORS, is_udplite);
        }


> 
> UDP: Account for TX drops
> 
> UDP layer is currently not incrementing error counters when packets are
> dropped. Use the SNDBUFERRORS to indicate packet drops on send.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> ---
>  net/ipv4/udp.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: linux-2.6/net/ipv4/udp.c
> ===================================================================
> --- linux-2.6.orig/net/ipv4/udp.c	2009-08-26 13:21:30.000000000 +0000
> +++ linux-2.6/net/ipv4/udp.c	2009-08-26 13:46:35.000000000 +0000
> @@ -559,6 +559,13 @@ static int udp_push_pending_frames(struc
> 
>  send:
>  	err = ip_push_pending_frames(sk);
> +
> +	if (err)
> +		/*
> +		 * Packet was dropped.
> +		 */
> +		UDP_INC_STATS_USER(sock_net(sk),
> +			UDP_MIB_SNDBUFERRORS, is_udplite);
>  out:
>  	up->len = 0;
>  	up->pending = 0;


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-26 14:22                               ` Eric Dumazet
@ 2009-08-26 15:27                                 ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-26 15:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Sridhar Samudrala, David Stevens, David S. Miller, netdev, niv, sri

On Wed, 26 Aug 2009, Eric Dumazet wrote:

> I think it's already done in udp_sendmsg()>
> Code starting at line 765 in net/ipv4/udp.c
>
>         /*
>          * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
>          * ENOBUFS might not be good (it's not tunable per se), but otherwise
>          * we don't have a good statistic (IpOutDiscards but it can be too many
>          * things).  We could add another new stat but at least for now that
>          * seems like overkill.
>          */
>         if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
>                 UDP_INC_STATS_USER(sock_net(sk),
>                                 UDP_MIB_SNDBUFERRORS, is_udplite);
>         }
>

Right. That would mean the fix to ip_push_pending_frames() would also fix
UDP tx drop accounting.

ENOBUFS is then returned for two cases.

1. SNDBUF overflow
2. NIC TX overflow

Hope that is not confusing.




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-25 22:35                           ` Sridhar Samudrala
  2009-08-26 14:08                             ` Christoph Lameter
@ 2009-08-26 16:29                             ` Christoph Lameter
  2009-08-26 17:50                               ` Sridhar Samudrala
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-26 16:29 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev,
	netdev-owner, niv, sri

On Tue, 25 Aug 2009, Sridhar Samudrala wrote:

> Could you check if the UDP packet losses you are seeing are accounted for in
> qdisc drops with this patch. But i am not completely positive on this as this
> case happens only if qdisc is deactivated.

This does not work. qdisc drops are still not reported. They are reported
for IP and UDP.

Test tool crashes on first TX overrun:

clameter@rd-strategy3-deb64:~$ ./mcast -n1 -r400000
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 0 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.123
Sender: Sending 400000 msgs/ch/sec on 1 channels. Probe interval=0.001-1
sec.
sendto: No buffer space available
Socket Send error

netstat reports exactly one packet loss:


clameter@rd-strategy3-deb64:~$ netstat -su
IcmpMsg:
    InType3: 1
    OutType3: 1
Udp:
    298 packets received
    0 packets to unknown port received.
    0 packet receive errors
    7232136 packets sent
    SndbufErrors: 1

root@rd-strategy3-deb64:/home/clameter#tc -s qdisc show
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
 Sent 6208 bytes 64 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

SNMP report one drop:

root@rd-strategy3-deb64:/home/clameter#cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors
ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests
OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails
FragOKs FragFails FragCreates
Ip: 2 64 1114 0 0 0 0 0 1114 7232754 1 0 0 0 0 0 0 0 0
Icmp: InMsgs InErrors InDestUnreachs InTimeExcds InParmProbs InSrcQuenchs
InRedirects InEchos InEchoReps InTimestamps InTimestampReps InAddrMasks
InAddrMaskReps OutMsgs OutErrors OutDestUnreachs OutTimeExcds OutParmProbs
OutSrcQuenchs OutRedirects OutEchos OutEchoReps OutTimestamps
OutTimestampReps OutAddrMasks OutAddrMaskReps
Icmp: 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
IcmpMsg: InType3 OutType3
IcmpMsg: 1 1
Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens
AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs
OutRsts
Tcp: 1 200 120000 -1 26 4 0 0 2 774 595 0 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 308 0 0 7232146 0 1
UdpLite: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors
SndbufErrors
UdpLite: 0 0 0 0 0 0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-26 16:29                             ` Christoph Lameter
@ 2009-08-26 17:50                               ` Sridhar Samudrala
  2009-08-26 19:09                                 ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-26 17:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev, niv

On Wed, 2009-08-26 at 12:29 -0400, Christoph Lameter wrote:
> On Tue, 25 Aug 2009, Sridhar Samudrala wrote:
> 
> > Could you check if the UDP packet losses you are seeing are accounted for in
> > qdisc drops with this patch. But i am not completely positive on this as this
> > case happens only if qdisc is deactivated.
> 
> This does not work. qdisc drops are still not reported.

OK. So the drops are not happening in dev_queue_xmit().

>  They are reported for IP and UDP.
Not clear what you meant by this.

> Test tool crashes on first TX overrun:
> 
> clameter@rd-strategy3-deb64:~$ ./mcast -n1 -r400000
> Receiver: Listening to control channel 239.0.192.1
> Receiver: Subscribing to 0 MC addresses 239.0.192-254.2-254 offset 0
> origin 10.2.36.123
> Sender: Sending 400000 msgs/ch/sec on 1 channels. Probe interval=0.001-1
> sec.
> sendto: No buffer space available
> Socket Send error
> 
> netstat reports exactly one packet loss:
> 
> 
> clameter@rd-strategy3-deb64:~$ netstat -su
> IcmpMsg:
>     InType3: 1
>     OutType3: 1
> Udp:
>     298 packets received
>     0 packets to unknown port received.
>     0 packet receive errors
>     7232136 packets sent
>     SndbufErrors: 1
> 
> root@rd-strategy3-deb64:/home/clameter#tc -s qdisc show
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1
> 1 1 1 1
>  Sent 6208 bytes 64 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0

Even the Sent count seems to be too low. Are you looking at the right
device?

So based on the current analysis, the packets are getting dropped after
the call to ip_local_out() in ip_push_pending_frames(). ip_local_out()
is failing with NET_XMIT_DROP. But we are not sure where they are
getting dropped. Is that right?

I think we need to figure out where they are getting dropped and then
decide on the appropriate counter to be incremented.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-26 17:50                               ` Sridhar Samudrala
@ 2009-08-26 19:09                                 ` Christoph Lameter
  2009-08-26 22:11                                   ` Sridhar Samudrala
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-26 19:09 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev, niv

On Wed, 26 Aug 2009, Sridhar Samudrala wrote:

> >  They are reported for IP and UDP.
> Not clear what you meant by this.

The SNMP and UDP statistics show the loss. qdisc level does not show the
loss.

> > root@rd-strategy3-deb64:/home/clameter#tc -s qdisc show
> > qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1
> > 1 1 1 1
> >  Sent 6208 bytes 64 pkt (dropped 0, overlimits 0 requeues 0)
> >  rate 0bit 0pps backlog 0b 0p requeues 0
>
> Even the Sent count seems to be too low. Are you looking at the right
> device?

I would think that tc displays all queues? It says eth0 and eth0 is the
device that we sent the data out on.


> So based on the current analysis, the packets are getting dropped after
> the call to ip_local_out() in ip_push_pending_frames(). ip_local_out()
> is failing with NET_XMIT_DROP. But we are not sure where they are
> getting dropped. Is that right?

ip_local_out is returning ENOBUFS. Something at the qdisc layer is
dropping the packet and not incrementing counters.

> I think we need to figure out where they are getting dropped and then
> decide on the appropriate counter to be incremented.

Right. Where in the qdisc layer do drops occur?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-26 19:09                                 ` Christoph Lameter
@ 2009-08-26 22:11                                   ` Sridhar Samudrala
  2009-08-27 15:40                                     ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Sridhar Samudrala @ 2009-08-26 22:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev, niv

On Wed, 2009-08-26 at 15:09 -0400, Christoph Lameter wrote:
> On Wed, 26 Aug 2009, Sridhar Samudrala wrote:
> 
> > >  They are reported for IP and UDP.
> > Not clear what you meant by this.
> 
> The SNMP and UDP statistics show the loss. qdisc level does not show the
> loss.
> > > root@rd-strategy3-deb64:/home/clameter#tc -s qdisc show
> > > qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1
> > > 1 1 1 1
> > >  Sent 6208 bytes 64 pkt (dropped 0, overlimits 0 requeues 0)
> > >  rate 0bit 0pps backlog 0b 0p requeues 0
> >
> > Even the Sent count seems to be too low. Are you looking at the right
> > device?
> 
> I would think that tc displays all queues? It says eth0 and eth0 is the
> device that we sent the data out on.


> 
> > So based on the current analysis, the packets are getting dropped after
> > the call to ip_local_out() in ip_push_pending_frames(). ip_local_out()
> > is failing with NET_XMIT_DROP. But we are not sure where they are
> > getting dropped. Is that right?
> 
> ip_local_out is returning ENOBUFS. Something at the qdisc layer is
> dropping the packet and not incrementing counters.

Is the ENOBUFS return with your/Eric's patch? I thought you were
were seeing NET_XMIT_DROP without any patches.

> 
> > I think we need to figure out where they are getting dropped and then
> > decide on the appropriate counter to be incremented.
> 
> Right. Where in the qdisc layer do drops occur?

The normal path where the packets are dropped when the tx qlen is exceeded is
  pfifo_fast_enqueue() -> qdisc_drop()
In this path, drops are counted.
The other place is in dev_queue_xmit(), but you are not hitting that case too.

So it looks like there is another place where they are getting dropped.

Thanks
Sridhar






^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-26 22:11                                   ` Sridhar Samudrala
@ 2009-08-27 15:40                                     ` Christoph Lameter
  2009-08-27 20:23                                       ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-27 15:40 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev, niv

On Wed, 26 Aug 2009, Sridhar Samudrala wrote:

> > ip_local_out is returning ENOBUFS. Something at the qdisc layer is
> > dropping the packet and not incrementing counters.
>
> Is the ENOBUFS return with your/Eric's patch? I thought you were
> were seeing NET_XMIT_DROP without any patches.

Both Erics latest patch and your patch were applied.

> > > I think we need to figure out where they are getting dropped and then
> > > decide on the appropriate counter to be incremented.
> >
> > Right. Where in the qdisc layer do drops occur?
>
> The normal path where the packets are dropped when the tx qlen is exceeded is
>   pfifo_fast_enqueue() -> qdisc_drop()
> In this path, drops are counted.
> The other place is in dev_queue_xmit(), but you are not hitting that case too.
>
> So it looks like there is another place where they are getting dropped.

Hmmm.. I need to find more time to dig into this.

Anyways it seems that Eric's latest patch is doing many good things for
packet loss accounting.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-27 15:40                                     ` Christoph Lameter
@ 2009-08-27 20:23                                       ` Christoph Lameter
  2009-08-28 13:53                                         ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-27 20:23 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev, niv


ip_local_out returns NET_XMIT_DROP but the qdisc increment does not
trigger...

[  219.098552] ip_local_out failed with 1
[  219.098798] ip_local_out failed with 1
[  219.099091] ip_local_out failed with 1
[  219.099158] ip_local_out failed with 1
[  219.099399] ip_local_out failed with 1
[  219.099466] ip_local_out failed with 1
[  219.099530] ip_local_out failed with 1
[  219.099688] ip_local_out failed with 1
[  219.099751] ip_local_out failed with 1
[  219.099818] ip_local_out failed with 1

---
 net/core/dev.c       |    2 ++
 net/ipv4/ip_output.c |    2 ++
 2 files changed, 4 insertions(+)

Index: linux-2.6.31-rc7/net/core/dev.c
===================================================================
--- linux-2.6.31-rc7.orig/net/core/dev.c	2009-08-27 19:46:44.000000000 +0000
+++ linux-2.6.31-rc7/net/core/dev.c	2009-08-27 19:51:54.000000000 +0000
@@ -1865,6 +1865,8 @@ gso:

 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
 			rc = qdisc_drop(skb, q);
+			if (net_ratelimit())
+				printk(KERN_CRIT "Qdisc not active. NIC overrun\n");
 		} else {
 			rc = qdisc_enqueue_root(skb, q);
 			qdisc_run(q);
Index: linux-2.6.31-rc7/net/ipv4/ip_output.c
===================================================================
--- linux-2.6.31-rc7.orig/net/ipv4/ip_output.c	2009-08-27 19:48:17.000000000 +0000
+++ linux-2.6.31-rc7/net/ipv4/ip_output.c	2009-08-27 19:51:30.000000000 +0000
@@ -1301,6 +1301,8 @@ int ip_push_pending_frames(struct sock *
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
+		if (net_ratelimit())
+			printk(KERN_CRIT "ip_local_out failed with %d\n", err);
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-27 20:23                                       ` Christoph Lameter
@ 2009-08-28 13:53                                         ` Christoph Lameter
  2009-08-28 15:07                                           ` Eric Dumazet
  2009-08-28 19:24                                           ` David Miller
  0 siblings, 2 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-28 13:53 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Stevens, David S. Miller, Eric Dumazet, netdev, niv


The qdisc drop counter is incremented in pfifo_fast. So Sridhar's patch is
not necessary.

Seems though that the qdisc drop count does not flow into the tx_dropped
counter for the interface. Incrementing the tx_dropped count in the
netdev_queue associated with the outbound qdisc also had no effect (see
the following patch).

Plus I only see one queue for eth0 with "tc -s qdisc show". I think that
what I see there is the queue for receiving packets. tc uses this ugly
netlink interface. Could be a bug in there or in the netlink interface?
Or is there some other trick to display queue statistics for outgoing
packets?

WTH is going on here? Noone was ever interested in making outbound packet
loss account right?


Index: linux-2.6.31-rc7/include/net/sch_generic.h
===================================================================
--- linux-2.6.31-rc7.orig/include/net/sch_generic.h	2009-08-27
21:20:03.000000000 +0000
+++ linux-2.6.31-rc7/include/net/sch_generic.h	2009-08-27
21:26:33.000000000 +0000
@@ -509,6 +509,9 @@ static inline int qdisc_drop(struct sk_b
 	kfree_skb(skb);
 	sch->qstats.drops++;

+	/* device queue statistics */
+	sch->dev_queue->tx_dropped++;
+
 	return NET_XMIT_DROP;
 }





^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 13:53                                         ` Christoph Lameter
@ 2009-08-28 15:07                                           ` Eric Dumazet
  2009-08-28 16:15                                             ` Christoph Lameter
  2009-08-28 19:24                                           ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-28 15:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, David Stevens, David S. Miller, netdev, niv

Christoph Lameter a écrit :
> The qdisc drop counter is incremented in pfifo_fast. So Sridhar's patch is
> not necessary.
> 
> Seems though that the qdisc drop count does not flow into the tx_dropped
> counter for the interface. Incrementing the tx_dropped count in the
> netdev_queue associated with the outbound qdisc also had no effect (see
> the following patch).
> 
> Plus I only see one queue for eth0 with "tc -s qdisc show". I think that
> what I see there is the queue for receiving packets. tc uses this ugly
> netlink interface. Could be a bug in there or in the netlink interface?
> Or is there some other trick to display queue statistics for outgoing
> packets?

"tc -s qdisc show" only displays queue info for tx packets.

> 
> WTH is going on here? Noone was ever interested in making outbound packet
> loss account right?
> 

I have no idea of what your problem can be Christoph.

Here, on unpatched git linux-2.6 kernel, default qdisc, and an udp tx flood I get :

# tc -s -d qdisc show dev eth3
qdisc pfifo_fast 0: root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 18025794122 bytes 17299241 pkt (dropped 264892, overlimits 0 requeues 68282)
 rate 0bit 0pps backlog 20840b 20p requeues 68282



> 
> Index: linux-2.6.31-rc7/include/net/sch_generic.h
> ===================================================================
> --- linux-2.6.31-rc7.orig/include/net/sch_generic.h	2009-08-27
> 21:20:03.000000000 +0000
> +++ linux-2.6.31-rc7/include/net/sch_generic.h	2009-08-27
> 21:26:33.000000000 +0000
> @@ -509,6 +509,9 @@ static inline int qdisc_drop(struct sk_b
>  	kfree_skb(skb);
>  	sch->qstats.drops++;
> 
> +	/* device queue statistics */
> +	sch->dev_queue->tx_dropped++;
> +
>  	return NET_XMIT_DROP;
>  }

locking problem here, tx_dropped can be changed by another cpu.


As David Stevens pointed out, device was not ever called at all when your packet(s) was/were lost.
Why should we account a non existent drop at device level ?

When a process wants a new memory page and hits its own limit, do you want to increment a system global
counter saying 'memory allocation failed' ?


So in my case :

$ ifconfig eth3
eth3      Link encap:Ethernet  HWaddr 00:1E:0B:92:78:51
          inet addr:192.168.0.2  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1188 errors:0 dropped:0 overruns:0 frame:0
          TX packets:63774907 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:633918 (619.0 KiB)  TX bytes:105287564 (100.4 MiB)
          Interrupt:16

And yes, dropped:0 is OK here, since packets where dropped at qdisc layer.


Only change you want is eventually to account for the UDP drop (SndbufErrors).


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 15:07                                           ` Eric Dumazet
@ 2009-08-28 16:15                                             ` Christoph Lameter
  2009-08-28 17:26                                               ` [PATCH net-next-2.6] ip: Report qdisc packet drops Eric Dumazet
  2009-08-28 19:26                                               ` UDP multicast packet loss not reported if TX ring overrun? David Miller
  0 siblings, 2 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-28 16:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Sridhar Samudrala, David Stevens, David S. Miller, netdev, niv

On Fri, 28 Aug 2009, Eric Dumazet wrote:

> "tc -s qdisc show" only displays queue info for tx packets.

Duh. The packet counters and bytes are way out of whack and do not reflect
what was actually sent. This must be some other qdisc.

> >  	sch->qstats.drops++;
> >
> > +	/* device queue statistics */
> > +	sch->dev_queue->tx_dropped++;
> > +
> >  	return NET_XMIT_DROP;
> >  }
>
> locking problem here, tx_dropped can be changed by another cpu.

Who cares. It just was for debugging.

> As David Stevens pointed out, device was not ever called at all when your packet(s) was/were lost.
> Why should we account a non existent drop at device level ?

Because you need drop statistics on a device to figure out when you may
want to increase the TX buffers for a device. If a packet was dropped
because of a lack of TX buffers then we need to know.

> When a process wants a new memory page and hits its own limit, do you want to increment a system global
> counter saying 'memory allocation failed' ?

When a process allocates and locks all of memory then you get an OOM.

> Only change you want is eventually to account for the UDP drop (SndbufErrors).

That is only a counter at the UDP layer. That one does not allow you to
identify which NIC it was nor which application caused it.

But its already a big improvement to see TX drops at all. Could we get
your latest patch merged soon?




^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-08-28 16:15                                             ` Christoph Lameter
@ 2009-08-28 17:26                                               ` Eric Dumazet
  2009-08-29  6:38                                                 ` David Miller
  2009-08-28 19:26                                               ` UDP multicast packet loss not reported if TX ring overrun? David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-28 17:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sridhar Samudrala, David Stevens, David S. Miller, netdev, niv,
	Michael Kerrisk

Christoph Lameter a écrit :
> On Fri, 28 Aug 2009, Eric Dumazet wrote:
>> Only change you want is eventually to account for the UDP drop (SndbufErrors).
> 
> That is only a counter at the UDP layer. That one does not allow you to
> identify which NIC it was nor which application caused it.

We dont have per-application SNMP counters, so application is responsible
to get proper syscall return check logic / accounting if necessary.

NIC level, just forget it, it makes no sense.

> 
> But its already a big improvement to see TX drops at all. Could we get
> your latest patch merged soon?

I officially submit it, but David have to take it or reject it :)

Thanks

[PATCH] ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing.
In case of tx drops at qdisc level, no error packet will be generated.
It seems un-necessary to hide the qdisc drops for non IP_RECVERR enabled
sockets (as probably most sockets are)

By removing the check of IP_RECVERR enabled sockets in ip_push_pending_frames()/
raw_send_hdrinc() / ip6_push_pending_frames() / rawv6_send_hdrinc(),
we can properly update IPSTATS_MIB_OUTDISCARDS, and in case of UDP, update
UDP_MIB_SNDBUFERRORS SNMP counters.

Application send() syscalls, instead of returning an OK status (thus lying),
will return -ENOBUFS error.

Note : send() manual page explicitly says for -ENOBUFS error :

 "The output queue for a network interface was full.
  This generally indicates that the interface has stopped sending,
  but may be caused by transient congestion.
  (Normally, this does not occur in Linux. Packets are just silently
  dropped when a device queue overflows.) "

This was not true for IP_RECVERR enabled sockets for < 2.6.32 linuxes,
and starting from linux 2.6.32, last part wont be true at all.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
 net/ipv4/ip_output.c  |    2 +-
 net/ipv4/raw.c        |    2 +-
 net/ipv6/ip6_output.c |    2 +-
 net/ipv6/raw.c        |    2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..afae0cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1302,7 +1302,7 @@ int ip_push_pending_frames(struct sock *sk)
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..80ff607 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -375,7 +375,7 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
 	if (err > 0)
-		err = inet->recverr ? net_xmit_errno(err) : 0;
+		err = net_xmit_errno(err);
 	if (err)
 		goto error;
 out:
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6ad5aad..537e8cf 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1511,7 +1511,7 @@ int ip6_push_pending_frames(struct sock *sk)
 	err = ip6_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5068410..1f7ee61 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -642,7 +642,7 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
 	if (err > 0)
-		err = np->recverr ? net_xmit_errno(err) : 0;
+		err = net_xmit_errno(err);
 	if (err)
 		goto error;
 out:

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 13:53                                         ` Christoph Lameter
  2009-08-28 15:07                                           ` Eric Dumazet
@ 2009-08-28 19:24                                           ` David Miller
  2009-08-28 19:53                                             ` Christoph Lameter
  2009-08-30  0:21                                             ` Mark Smith
  1 sibling, 2 replies; 73+ messages in thread
From: David Miller @ 2009-08-28 19:24 UTC (permalink / raw)
  To: cl; +Cc: sri, dlstevens, eric.dumazet, netdev, niv

From: Christoph Lameter <cl@linux-foundation.org>
Date: Fri, 28 Aug 2009 09:53:40 -0400 (EDT)

> Seems though that the qdisc drop count does not flow into the tx_dropped
> counter for the interface.

And it should not.

The qdisc drops the packet due to flow control, not the hardware
device.

Device drops are for things like transmission errors on the wire.

If you start incrementing tx_dropped here, people won't be able
to tell they have a deteriorating cable or bad switch or similar.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 16:15                                             ` Christoph Lameter
  2009-08-28 17:26                                               ` [PATCH net-next-2.6] ip: Report qdisc packet drops Eric Dumazet
@ 2009-08-28 19:26                                               ` David Miller
  2009-08-28 20:00                                                 ` Christoph Lameter
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2009-08-28 19:26 UTC (permalink / raw)
  To: cl; +Cc: eric.dumazet, sri, dlstevens, netdev, niv

From: Christoph Lameter <cl@linux-foundation.org>
Date: Fri, 28 Aug 2009 12:15:39 -0400 (EDT)

> Because you need drop statistics on a device to figure out when you may
> want to increase the TX buffers for a device. If a packet was dropped
> because of a lack of TX buffers then we need to know.

Christoph, the qdisc layer is there and is where the drops occur, and
you therefore have to be aware of it.

I'm not accepting changes like the patch you propose here, it's not
the correct thing to do.

The device never saw the packet, it never dropped it.

The qdisc saw it, the qdisc dropped it, and therefore that's where we
account for it.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 19:24                                           ` David Miller
@ 2009-08-28 19:53                                             ` Christoph Lameter
  2009-08-28 20:03                                               ` David Miller
  2009-08-30  0:21                                             ` Mark Smith
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-28 19:53 UTC (permalink / raw)
  To: David Miller; +Cc: sri, dlstevens, eric.dumazet, netdev, niv

On Fri, 28 Aug 2009, David Miller wrote:

> From: Christoph Lameter <cl@linux-foundation.org>
> Date: Fri, 28 Aug 2009 09:53:40 -0400 (EDT)
>
> > Seems though that the qdisc drop count does not flow into the tx_dropped
> > counter for the interface.
>
> And it should not.
>
> The qdisc drops the packet due to flow control, not the hardware
> device.
>
> Device drops are for things like transmission errors on the wire.
>
> If you start incrementing tx_dropped here, people won't be able
> to tell they have a deteriorating cable or bad switch or similar.

?? The patch does not do that. That was extra stuff not covered by the
patch.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 19:26                                               ` UDP multicast packet loss not reported if TX ring overrun? David Miller
@ 2009-08-28 20:00                                                 ` Christoph Lameter
  2009-08-28 20:04                                                   ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-08-28 20:00 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, sri, dlstevens, netdev, niv

On Fri, 28 Aug 2009, David Miller wrote:

> I'm not accepting changes like the patch you propose here, it's not
> the correct thing to do.
>
> The device never saw the packet, it never dropped it.
>
> The qdisc saw it, the qdisc dropped it, and therefore that's where we
> account for it.

udp send currently drops packets without any error message and without
incrementing any counters.

That occurs only if the socket option IP_RECVERR is not set.

If you set IP_RECVERR then statistics are kept about dropped packets and
they are displayed as drops at the IP and UDP layer.

Are you saying that this behavior is okay? It is desired behavior that a
socket option changes the way how global network counters are handled?



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 19:53                                             ` Christoph Lameter
@ 2009-08-28 20:03                                               ` David Miller
  2009-08-28 20:09                                                 ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2009-08-28 20:03 UTC (permalink / raw)
  To: cl; +Cc: sri, dlstevens, eric.dumazet, netdev, niv

From: Christoph Lameter <cl@linux-foundation.org>
Date: Fri, 28 Aug 2009 15:53:42 -0400 (EDT)

> On Fri, 28 Aug 2009, David Miller wrote:
> 
>> If you start incrementing tx_dropped here, people won't be able
>> to tell they have a deteriorating cable or bad switch or similar.
> 
> ?? The patch does not do that. That was extra stuff not covered by the
> patch.

Yes you did, you put a device tx_dropped counter bump into
qdisc_drop().

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 20:00                                                 ` Christoph Lameter
@ 2009-08-28 20:04                                                   ` David Miller
  0 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2009-08-28 20:04 UTC (permalink / raw)
  To: cl; +Cc: eric.dumazet, sri, dlstevens, netdev, niv

From: Christoph Lameter <cl@linux-foundation.org>
Date: Fri, 28 Aug 2009 16:00:24 -0400 (EDT)

> Are you saying that this behavior is okay? It is desired behavior that a
> socket option changes the way how global network counters are handled?

I'm saying that bumping the per-device tx_dropped counter in
the qdisc layer is wrong, nothing more.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 20:03                                               ` David Miller
@ 2009-08-28 20:09                                                 ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-08-28 20:09 UTC (permalink / raw)
  To: David Miller; +Cc: sri, dlstevens, eric.dumazet, netdev, niv

On Fri, 28 Aug 2009, David Miller wrote:

> From: Christoph Lameter <cl@linux-foundation.org>
> Date: Fri, 28 Aug 2009 15:53:42 -0400 (EDT)
>
> > On Fri, 28 Aug 2009, David Miller wrote:
> >
> >> If you start incrementing tx_dropped here, people won't be able
> >> to tell they have a deteriorating cable or bad switch or similar.
> >
> > ?? The patch does not do that. That was extra stuff not covered by the
> > patch.
>
> Yes you did, you put a device tx_dropped counter bump into
> qdisc_drop().

Yes I did to that to test if I could propagate the counter somehow to the
device struct. But that did not work and showed that the qdisc structure
displayed by tc is different from the qdisc in use for the device. There
must be some other bug that causes the use of data from a different qdisc
to be displayed. Maybe device specific.

The patch definitely was never intended to be submitted for inclusion.
Just to show what the nature of the bug is.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-08-28 17:26                                               ` [PATCH net-next-2.6] ip: Report qdisc packet drops Eric Dumazet
@ 2009-08-29  6:38                                                 ` David Miller
  2009-08-31 12:09                                                   ` Eric Dumazet
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2009-08-29  6:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 28 Aug 2009 19:26:04 +0200

> [PATCH] ip: Report qdisc packet drops
> 
> Christoph Lameter pointed out that packet drops at qdisc level where not
> accounted in SNMP counters. Only if application sets IP_RECVERR, drops
> are reported to user and SNMP counters updated.
> 
> IP_RECVERR is used to enable extended reliable error message passing.
> In case of tx drops at qdisc level, no error packet will be generated.
> It seems un-necessary to hide the qdisc drops for non IP_RECVERR enabled
> sockets (as probably most sockets are)
> 
> By removing the check of IP_RECVERR enabled sockets in ip_push_pending_frames()/
> raw_send_hdrinc() / ip6_push_pending_frames() / rawv6_send_hdrinc(),
> we can properly update IPSTATS_MIB_OUTDISCARDS, and in case of UDP, update
> UDP_MIB_SNDBUFERRORS SNMP counters.
> 
> Application send() syscalls, instead of returning an OK status (thus lying),
> will return -ENOBUFS error.
> 
> Note : send() manual page explicitly says for -ENOBUFS error :
> 
>  "The output queue for a network interface was full.
>   This generally indicates that the interface has stopped sending,
>   but may be caused by transient congestion.
>   (Normally, this does not occur in Linux. Packets are just silently
>   dropped when a device queue overflows.) "
> 
> This was not true for IP_RECVERR enabled sockets for < 2.6.32 linuxes,
> and starting from linux 2.6.32, last part wont be true at all.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

The core question in all of this is what IP_RECVERR means.

As far as I remember Alexey Kuznetsov's intentions, it means that the
application is interested in learning about errors caused by the
infrastructure of the network between local and remote stacks.

Reporting a qdisc level drop to the application by default has the
potential to break applications, because BSD and other stacks do not
do this.

I can see why we might be able to get away with making this change
now.  And I also can see the benefits of it, for sure.

Let me think about this some more.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: UDP multicast packet loss not reported if TX ring overrun?
  2009-08-28 19:24                                           ` David Miller
  2009-08-28 19:53                                             ` Christoph Lameter
@ 2009-08-30  0:21                                             ` Mark Smith
  1 sibling, 0 replies; 73+ messages in thread
From: Mark Smith @ 2009-08-30  0:21 UTC (permalink / raw)
  To: David Miller; +Cc: cl, sri, dlstevens, eric.dumazet, netdev, niv

On Fri, 28 Aug 2009 12:24:59 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Christoph Lameter <cl@linux-foundation.org>
> Date: Fri, 28 Aug 2009 09:53:40 -0400 (EDT)
> 
> > Seems though that the qdisc drop count does not flow into the tx_dropped
> > counter for the interface.
> 
> And it should not.
> 
> The qdisc drops the packet due to flow control, not the hardware
> device.
> 
> Device drops are for things like transmission errors on the wire.
> 
> If you start incrementing tx_dropped here, people won't be able
> to tell they have a deteriorating cable or bad switch or similar.

And it does, because Cisco do it this way, although they record them as
TX errors not drops. It's quite annoying to have to keep reminding
yourself that the errors shown on the traffic graphs probably aren't
actually errors - of course real errors then become hidden in the
"normal" ones.

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-08-29  6:38                                                 ` David Miller
@ 2009-08-31 12:09                                                   ` Eric Dumazet
  2009-09-02  1:41                                                     ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-08-31 12:09 UTC (permalink / raw)
  To: David Miller; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 28 Aug 2009 19:26:04 +0200
> 
>> [PATCH] ip: Report qdisc packet drops
>>
>> Christoph Lameter pointed out that packet drops at qdisc level where not
>> accounted in SNMP counters. Only if application sets IP_RECVERR, drops
>> are reported to user and SNMP counters updated.
>>
>> IP_RECVERR is used to enable extended reliable error message passing.
>> In case of tx drops at qdisc level, no error packet will be generated.
>> It seems un-necessary to hide the qdisc drops for non IP_RECVERR enabled
>> sockets (as probably most sockets are)
>>
>> By removing the check of IP_RECVERR enabled sockets in ip_push_pending_frames()/
>> raw_send_hdrinc() / ip6_push_pending_frames() / rawv6_send_hdrinc(),
>> we can properly update IPSTATS_MIB_OUTDISCARDS, and in case of UDP, update
>> UDP_MIB_SNDBUFERRORS SNMP counters.
>>
>> Application send() syscalls, instead of returning an OK status (thus lying),
>> will return -ENOBUFS error.
>>
>> Note : send() manual page explicitly says for -ENOBUFS error :
>>
>>  "The output queue for a network interface was full.
>>   This generally indicates that the interface has stopped sending,
>>   but may be caused by transient congestion.
>>   (Normally, this does not occur in Linux. Packets are just silently
>>   dropped when a device queue overflows.) "
>>
>> This was not true for IP_RECVERR enabled sockets for < 2.6.32 linuxes,
>> and starting from linux 2.6.32, last part wont be true at all.
>>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
> 
> The core question in all of this is what IP_RECVERR means.
> 
> As far as I remember Alexey Kuznetsov's intentions, it means that the
> application is interested in learning about errors caused by the
> infrastructure of the network between local and remote stacks.
> 
> Reporting a qdisc level drop to the application by default has the
> potential to break applications, because BSD and other stacks do not
> do this.
> 
> I can see why we might be able to get away with making this change
> now.  And I also can see the benefits of it, for sure.
> 
> Let me think about this some more.

Yes I do agree this is risky.

Re-reading again this stuff, I realized ip6_push_pending_frames()
was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.

May I suggest following path :

1) Correct ip6_push_pending_frames() to properly
account for dropped-by-qdisc frames when IP_RECVERR is set

2) Submit a patch to account for qdisc-dropped frames in SNMP counters
but still return a OK to user application, to not break them ?

Thanks

[PATCH] ipv6: ip6_push_pending_frames() should increment IPSTATS_MIB_OUTDISCARDS

qdisc drops should be notified to IP_RECVERR enabled sockets, as done in IPV4.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6ad5aad..a931229 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1520,6 +1520,7 @@ out:
 	ip6_cork_release(inet, np);
 	return err;
 error:
+	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
 	goto out;
 }
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-08-31 12:09                                                   ` Eric Dumazet
@ 2009-09-02  1:41                                                     ` David Miller
  2009-09-02 14:43                                                       ` Eric Dumazet
  2009-09-02 18:22                                                       ` Christoph Lameter
  0 siblings, 2 replies; 73+ messages in thread
From: David Miller @ 2009-09-02  1:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 31 Aug 2009 14:09:50 +0200

> Re-reading again this stuff, I realized ip6_push_pending_frames()
> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
> 
> May I suggest following path :
> 
> 1) Correct ip6_push_pending_frames() to properly
> account for dropped-by-qdisc frames when IP_RECVERR is set

Your patch is  applied to net-next-2.6, thanks!

> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
> but still return a OK to user application, to not break them ?

Sounds good.

I think if you sample random UDP applications, you will find that such
errors will upset them terribly, make them log tons of crap to
/var/log/messages et al., and consume tons of CPU.

And in such cases silent ignoring of drops is entirely appropriate and
optimal, which supports our current behavior.

If we are to make such applications "more sophisticated" such
converted apps can be indicated simply their use of IP_RECVERR.

If you want to be notified of all asynchronous errors we can detect,
you use this, end of story.  It is the only way to handle this
situation without breaking the world.

As usual, Alexey Kuznetsov's analysis of this situation is timeless,
accurate, and wise.  And he understood all of this 10+ years ago.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02  1:41                                                     ` David Miller
@ 2009-09-02 14:43                                                       ` Eric Dumazet
  2009-09-02 16:11                                                         ` Sridhar Samudrala
                                                                           ` (2 more replies)
  2009-09-02 18:22                                                       ` Christoph Lameter
  1 sibling, 3 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-09-02 14:43 UTC (permalink / raw)
  To: David Miller; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Mon, 31 Aug 2009 14:09:50 +0200
> 
>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>
>> May I suggest following path :
>>
>> 1) Correct ip6_push_pending_frames() to properly
>> account for dropped-by-qdisc frames when IP_RECVERR is set
> 
> Your patch is  applied to net-next-2.6, thanks!
> 
>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>> but still return a OK to user application, to not break them ?
> 
> Sounds good.
> 
> I think if you sample random UDP applications, you will find that such
> errors will upset them terribly, make them log tons of crap to
> /var/log/messages et al., and consume tons of CPU.
> 
> And in such cases silent ignoring of drops is entirely appropriate and
> optimal, which supports our current behavior.
> 
> If we are to make such applications "more sophisticated" such
> converted apps can be indicated simply their use of IP_RECVERR.
> 
> If you want to be notified of all asynchronous errors we can detect,
> you use this, end of story.  It is the only way to handle this
> situation without breaking the world.
> 
> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> accurate, and wise.  And he understood all of this 10+ years ago.

Thanks David, here is the 2nd patch then :


[PATCH net-next-2.6] ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s 
...
IP:
    1487048 outgoing packets dropped
...
Udp:
...
    SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

 "The output queue for a network interface was full.
  This generally indicates that the interface has stopped sending,
  but may be caused by transient congestion.
  (Normally, this does not occur in Linux. Packets are just silently
  dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ip.h      |    2 +-
 include/net/ipv6.h    |    2 +-
 include/net/udp.h     |    2 +-
 net/ipv4/icmp.c       |    2 +-
 net/ipv4/ip_output.c  |   19 ++++++++++---------
 net/ipv4/raw.c        |   14 ++++++++++----
 net/ipv4/udp.c        |   20 +++++++++++++-------
 net/ipv6/icmp.c       |    2 +-
 net/ipv6/ip6_output.c |   18 +++++++++++-------
 net/ipv6/raw.c        |   15 ++++++++++-----
 net/ipv6/udp.c        |   14 ++++++++++----
 11 files changed, 69 insertions(+), 41 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 72c3692..9dd19a8 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -116,7 +116,7 @@ extern int		ip_append_data(struct sock *sk,
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
 				int offset, size_t size, int flags);
-extern int		ip_push_pending_frames(struct sock *sk);
+extern int		ip_push_pending_frames(struct sock *sk, int recverr);
 extern void		ip_flush_pending_frames(struct sock *sk);
 
 /* datagram.c */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ad9a511..f514257 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -498,7 +498,7 @@ extern int			ip6_append_data(struct sock *sk,
 						struct rt6_info *rt,
 						unsigned int flags);
 
-extern int			ip6_push_pending_frames(struct sock *sk);
+extern int			ip6_push_pending_frames(struct sock *sk, int recverr);
 
 extern void			ip6_flush_pending_frames(struct sock *sk);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 5fb029f..a60ef10 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -145,7 +145,7 @@ extern int 	udp_lib_getsockopt(struct sock *sk, int level, int optname,
 			           char __user *optval, int __user *optlen);
 extern int 	udp_lib_setsockopt(struct sock *sk, int level, int optname,
 				   char __user *optval, int optlen,
-				   int (*push_pending_frames)(struct sock *));
+				   int (*push_pending_frames)(struct sock *, int));
 
 extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
 				    __be32 daddr, __be16 dport,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 97c410e..f46a53c 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -345,7 +345,7 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
 						 icmp_param->head_len, csum);
 		icmph->checksum = csum_fold(csum);
 		skb->ip_summed = CHECKSUM_NONE;
-		ip_push_pending_frames(sk);
+		ip_push_pending_frames(sk, 0);
 	}
 }
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..8f81dab 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1216,7 +1216,7 @@ static void ip_cork_release(struct inet_sock *inet)
  *	Combined all pending IP fragments on the socket as one IP datagram
  *	and push them out.
  */
-int ip_push_pending_frames(struct sock *sk)
+int ip_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1301,19 +1301,20 @@ int ip_push_pending_frames(struct sock *sk)
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
-		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+		if (err > 0) {
+			err = net_xmit_errno(err);
+			if (err && !recverr) {
+				IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+				err = 0;
+			}
+		}
 		if (err)
-			goto error;
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 /*
@@ -1412,7 +1413,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 			  arg->csumoffset) = csum_fold(csum_add(skb->csum,
 								arg->csum));
 		skb->ip_summed = CHECKSUM_NONE;
-		ip_push_pending_frames(sk);
+		ip_push_pending_frames(sk, 0);
 	}
 
 	bh_unlock_sock(sk);
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..444c465 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -374,8 +374,13 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 
 	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
-	if (err > 0)
-		err = inet->recverr ? net_xmit_errno(err) : 0;
+	if (err > 0) {
+		err = net_xmit_errno(err);
+		if (!inet->recverr && err) {
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			err = 0;
+		}
+	}
 	if (err)
 		goto error;
 out:
@@ -576,8 +581,9 @@ back_from_confirm:
 					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
-		else if (!(msg->msg_flags & MSG_MORE))
-			err = ip_push_pending_frames(sk);
+		else if (!(msg->msg_flags & MSG_MORE)) {
+			err = ip_push_pending_frames(sk, inet->recverr);
+		}
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 29ebb0d..6a6bf1d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -513,7 +513,7 @@ static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
 /*
  * Push out all pending data as one UDP datagram. Socket is locked.
  */
-static int udp_push_pending_frames(struct sock *sk)
+static int udp_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct udp_sock  *up = udp_sk(sk);
 	struct inet_sock *inet = inet_sk(sk);
@@ -560,7 +560,7 @@ static int udp_push_pending_frames(struct sock *sk)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip_push_pending_frames(sk);
+	err = ip_push_pending_frames(sk, recverr);
 out:
 	up->len = 0;
 	up->pending = 0;
@@ -752,8 +752,14 @@ do_append_data:
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_flush_pending_frames(sk);
-	else if (!corkreq)
-		err = udp_push_pending_frames(sk);
+	else if (!corkreq) {
+		err = udp_push_pending_frames(sk, 1);
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	}
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;
 	release_sock(sk);
@@ -826,7 +832,7 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 
 	up->len += size;
 	if (!(up->corkflag || (flags&MSG_MORE)))
-		ret = udp_push_pending_frames(sk);
+		ret = udp_push_pending_frames(sk, inet_sk(sk)->recverr);
 	if (!ret)
 		ret = size;
 out:
@@ -1354,7 +1360,7 @@ void udp_destroy_sock(struct sock *sk)
  */
 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		       char __user *optval, int optlen,
-		       int (*push_pending_frames)(struct sock *))
+		       int (*push_pending_frames)(struct sock *, int))
 {
 	struct udp_sock *up = udp_sk(sk);
 	int val;
@@ -1374,7 +1380,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
 		} else {
 			up->corkflag = 0;
 			lock_sock(sk);
-			(*push_pending_frames)(sk);
+			(*push_pending_frames)(sk, 0);
 			release_sock(sk);
 		}
 		break;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index e2325f6..a9c54c2 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -253,7 +253,7 @@ static int icmpv6_push_pending_frames(struct sock *sk, struct flowi *fl, struct
 						      len, fl->proto,
 						      tmp_csum);
 	}
-	ip6_push_pending_frames(sk);
+	ip6_push_pending_frames(sk, 0);
 out:
 	return err;
 }
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a931229..ade5707 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1440,7 +1440,7 @@ static void ip6_cork_release(struct inet_sock *inet, struct ipv6_pinfo *np)
 	memset(&inet->cork.fl, 0, sizeof(inet->cork.fl));
 }
 
-int ip6_push_pending_frames(struct sock *sk)
+int ip6_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb, *tmp_skb;
 	struct sk_buff **tail_skb;
@@ -1510,18 +1510,22 @@ int ip6_push_pending_frames(struct sock *sk)
 
 	err = ip6_local_out(skb);
 	if (err) {
-		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+		if (err > 0) {
+			err = net_xmit_errno(err);
+			if (err && !recverr) {
+				IP6_INC_STATS(net, rt->rt6i_idev,
+					      IPSTATS_MIB_OUTDISCARDS);
+				err = 0;
+			}
+		}
 		if (err)
-			goto error;
+			IP6_INC_STATS(net, rt->rt6i_idev,
+				      IPSTATS_MIB_OUTDISCARDS);
 	}
 
 out:
 	ip6_cork_release(inet, np);
 	return err;
-error:
-	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 void ip6_flush_pending_frames(struct sock *sk)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5068410..d054fa2 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -523,7 +523,7 @@ csum_copy_err:
 }
 
 static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
-				     struct raw6_sock *rp)
+				     struct raw6_sock *rp, int recverr)
 {
 	struct sk_buff *skb;
 	int err = 0;
@@ -595,7 +595,7 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
 		BUG();
 
 send:
-	err = ip6_push_pending_frames(sk);
+	err = ip6_push_pending_frames(sk, recverr);
 out:
 	return err;
 }
@@ -641,8 +641,13 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
 	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
-	if (err > 0)
-		err = np->recverr ? net_xmit_errno(err) : 0;
+	if (err > 0) {
+		err = net_xmit_errno(err);
+		if (!np->recverr && err) {
+			IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
+			err = 0;
+		}
+	}
 	if (err)
 		goto error;
 out:
@@ -895,7 +900,7 @@ back_from_confirm:
 		if (err)
 			ip6_flush_pending_frames(sk);
 		else if (!(msg->msg_flags & MSG_MORE))
-			err = rawv6_push_pending_frames(sk, &fl, rp);
+			err = rawv6_push_pending_frames(sk, &fl, rp, np->recverr);
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 20d2ffc..963dd0a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -683,7 +683,7 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
  *	Sending
  */
 
-static int udp_v6_push_pending_frames(struct sock *sk)
+static int udp_v6_push_pending_frames(struct sock *sk, int recverr)
 {
 	struct sk_buff *skb;
 	struct udphdr *uh;
@@ -723,7 +723,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
 		uh->check = CSUM_MANGLED_0;
 
 send:
-	err = ip6_push_pending_frames(sk);
+	err = ip6_push_pending_frames(sk, recverr);
 out:
 	up->len = 0;
 	up->pending = 0;
@@ -975,8 +975,14 @@ do_append_data:
 		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_v6_flush_pending_frames(sk);
-	else if (!corkreq)
-		err = udp_v6_push_pending_frames(sk);
+	else if (!corkreq) {
+		err = udp_v6_push_pending_frames(sk, 1);
+		if (err == -ENOBUFS && !np->recverr) {
+			UDP6_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	}
 	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
 		up->pending = 0;
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 19:37                                                         ` Christoph Lameter
@ 2009-09-02 16:05                                                           ` Eric Dumazet
  0 siblings, 0 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-09-02 16:05 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, sri, dlstevens, netdev, niv, mtk.manpages

Christoph Lameter a écrit :
> The patch is smaller if you remove the handling of recverr completely from
> ip_push_pending_frames() and return NET_RX_DROP etc. Two of the callers
> never even inspect the return code. For them this is useless processing.

We must check NET_XMIT_CN before doing the update, but you are probably right.

I'll cook another patch ASAP, thanks !



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 14:43                                                       ` Eric Dumazet
@ 2009-09-02 16:11                                                         ` Sridhar Samudrala
  2009-09-02 16:20                                                           ` Eric Dumazet
  2009-09-02 19:37                                                         ` Christoph Lameter
  2009-09-02 22:26                                                         ` Eric Dumazet
  2 siblings, 1 reply; 73+ messages in thread
From: Sridhar Samudrala @ 2009-09-02 16:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, cl, dlstevens, netdev, niv, mtk.manpages

On Wed, 2009-09-02 at 16:43 +0200, Eric Dumazet wrote:
> David Miller a écrit :
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 31 Aug 2009 14:09:50 +0200
> > 
> >> Re-reading again this stuff, I realized ip6_push_pending_frames()
> >> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
> >>
> >> May I suggest following path :
> >>
> >> 1) Correct ip6_push_pending_frames() to properly
> >> account for dropped-by-qdisc frames when IP_RECVERR is set
> > 
> > Your patch is  applied to net-next-2.6, thanks!
> > 
> >> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
> >> but still return a OK to user application, to not break them ?
> > 
> > Sounds good.
> > 
> > I think if you sample random UDP applications, you will find that such
> > errors will upset them terribly, make them log tons of crap to
> > /var/log/messages et al., and consume tons of CPU.
> > 
> > And in such cases silent ignoring of drops is entirely appropriate and
> > optimal, which supports our current behavior.
> > 
> > If we are to make such applications "more sophisticated" such
> > converted apps can be indicated simply their use of IP_RECVERR.
> > 
> > If you want to be notified of all asynchronous errors we can detect,
> > you use this, end of story.  It is the only way to handle this
> > situation without breaking the world.
> > 
> > As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> > accurate, and wise.  And he understood all of this 10+ years ago.
> 
> Thanks David, here is the 2nd patch then :
> 
> 
> [PATCH net-next-2.6] ip: Report qdisc packet drops
> 
> Christoph Lameter pointed out that packet drops at qdisc level where not
> accounted in SNMP counters. Only if application sets IP_RECVERR, drops
> are reported to user (-ENOBUFS errors) and SNMP counters updated.
> 
> IP_RECVERR is used to enable extended reliable error message passing,
> but these are not needed to update system wide SNMP stats.
> 
> This patch changes things a bit to allow SNMP counters to be updated,
> regardless of IP_RECVERR being set or not on the socket.
> 
> Example after an UDP tx flood
> # netstat -s 
> ...
> IP:
>     1487048 outgoing packets dropped
> ...
> Udp:
> ...
>     SndbufErrors: 1487048
> 

Didn't we agree that qdisc drops should not be counted as IP or UDP 
drops as David Stevens pointed out?
I would say that even when IP_RECVERR is set, SNMP counters at IP and
UDP should not be counted when a packet is dropped at qdisc level,
but the error can be reported to user.

Now that qdisc stats issue is figured out and they can be accounted
and seen at qdisc level, doesn't it confuse if we count the same drop 
at IP, UDP and qdisc level?

Thanks
Sridhar

> 
> send() syscalls, do however still return an OK status, to not
> break applications.
> 
> Note : send() manual page explicitly says for -ENOBUFS error :
> 
>  "The output queue for a network interface was full.
>   This generally indicates that the interface has stopped sending,
>   but may be caused by transient congestion.
>   (Normally, this does not occur in Linux. Packets are just silently
>   dropped when a device queue overflows.) "
> 
> This is not true for IP_RECVERR enabled sockets : a send() syscall
> that hit a qdisc drop returns an ENOBUFS error.
> 
> Many thanks to Christoph, David, and last but not least, Alexey !
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/ip.h      |    2 +-
>  include/net/ipv6.h    |    2 +-
>  include/net/udp.h     |    2 +-
>  net/ipv4/icmp.c       |    2 +-
>  net/ipv4/ip_output.c  |   19 ++++++++++---------
>  net/ipv4/raw.c        |   14 ++++++++++----
>  net/ipv4/udp.c        |   20 +++++++++++++-------
>  net/ipv6/icmp.c       |    2 +-
>  net/ipv6/ip6_output.c |   18 +++++++++++-------
>  net/ipv6/raw.c        |   15 ++++++++++-----
>  net/ipv6/udp.c        |   14 ++++++++++----
>  11 files changed, 69 insertions(+), 41 deletions(-)
> 
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 72c3692..9dd19a8 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -116,7 +116,7 @@ extern int		ip_append_data(struct sock *sk,
>  extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
>  extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
>  				int offset, size_t size, int flags);
> -extern int		ip_push_pending_frames(struct sock *sk);
> +extern int		ip_push_pending_frames(struct sock *sk, int recverr);
>  extern void		ip_flush_pending_frames(struct sock *sk);
> 
>  /* datagram.c */
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index ad9a511..f514257 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -498,7 +498,7 @@ extern int			ip6_append_data(struct sock *sk,
>  						struct rt6_info *rt,
>  						unsigned int flags);
> 
> -extern int			ip6_push_pending_frames(struct sock *sk);
> +extern int			ip6_push_pending_frames(struct sock *sk, int recverr);
> 
>  extern void			ip6_flush_pending_frames(struct sock *sk);
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 5fb029f..a60ef10 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -145,7 +145,7 @@ extern int 	udp_lib_getsockopt(struct sock *sk, int level, int optname,
>  			           char __user *optval, int __user *optlen);
>  extern int 	udp_lib_setsockopt(struct sock *sk, int level, int optname,
>  				   char __user *optval, int optlen,
> -				   int (*push_pending_frames)(struct sock *));
> +				   int (*push_pending_frames)(struct sock *, int));
> 
>  extern struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport,
>  				    __be32 daddr, __be16 dport,
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 97c410e..f46a53c 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -345,7 +345,7 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
>  						 icmp_param->head_len, csum);
>  		icmph->checksum = csum_fold(csum);
>  		skb->ip_summed = CHECKSUM_NONE;
> -		ip_push_pending_frames(sk);
> +		ip_push_pending_frames(sk, 0);
>  	}
>  }
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7d08210..8f81dab 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1216,7 +1216,7 @@ static void ip_cork_release(struct inet_sock *inet)
>   *	Combined all pending IP fragments on the socket as one IP datagram
>   *	and push them out.
>   */
> -int ip_push_pending_frames(struct sock *sk)
> +int ip_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct sk_buff *skb, *tmp_skb;
>  	struct sk_buff **tail_skb;
> @@ -1301,19 +1301,20 @@ int ip_push_pending_frames(struct sock *sk)
>  	/* Netfilter gets whole the not fragmented skb. */
>  	err = ip_local_out(skb);
>  	if (err) {
> -		if (err > 0)
> -			err = inet->recverr ? net_xmit_errno(err) : 0;
> +		if (err > 0) {
> +			err = net_xmit_errno(err);
> +			if (err && !recverr) {
> +				IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> +				err = 0;
> +			}
> +		}
>  		if (err)
> -			goto error;
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
>  	}
> 
>  out:
>  	ip_cork_release(inet);
>  	return err;
> -
> -error:
> -	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  /*
> @@ -1412,7 +1413,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
>  			  arg->csumoffset) = csum_fold(csum_add(skb->csum,
>  								arg->csum));
>  		skb->ip_summed = CHECKSUM_NONE;
> -		ip_push_pending_frames(sk);
> +		ip_push_pending_frames(sk, 0);
>  	}
> 
>  	bh_unlock_sock(sk);
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 2979f14..444c465 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -374,8 +374,13 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
> 
>  	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
>  		      dst_output);
> -	if (err > 0)
> -		err = inet->recverr ? net_xmit_errno(err) : 0;
> +	if (err > 0) {
> +		err = net_xmit_errno(err);
> +		if (!inet->recverr && err) {
> +			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
> +			err = 0;
> +		}
> +	}
>  	if (err)
>  		goto error;
>  out:
> @@ -576,8 +581,9 @@ back_from_confirm:
>  					&ipc, &rt, msg->msg_flags);
>  		if (err)
>  			ip_flush_pending_frames(sk);
> -		else if (!(msg->msg_flags & MSG_MORE))
> -			err = ip_push_pending_frames(sk);
> +		else if (!(msg->msg_flags & MSG_MORE)) {
> +			err = ip_push_pending_frames(sk, inet->recverr);
> +		}
>  		release_sock(sk);
>  	}
>  done:
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 29ebb0d..6a6bf1d 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -513,7 +513,7 @@ static void udp4_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
>  /*
>   * Push out all pending data as one UDP datagram. Socket is locked.
>   */
> -static int udp_push_pending_frames(struct sock *sk)
> +static int udp_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct udp_sock  *up = udp_sk(sk);
>  	struct inet_sock *inet = inet_sk(sk);
> @@ -560,7 +560,7 @@ static int udp_push_pending_frames(struct sock *sk)
>  		uh->check = CSUM_MANGLED_0;
> 
>  send:
> -	err = ip_push_pending_frames(sk);
> +	err = ip_push_pending_frames(sk, recverr);
>  out:
>  	up->len = 0;
>  	up->pending = 0;
> @@ -752,8 +752,14 @@ do_append_data:
>  			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
>  	if (err)
>  		udp_flush_pending_frames(sk);
> -	else if (!corkreq)
> -		err = udp_push_pending_frames(sk);
> +	else if (!corkreq) {
> +		err = udp_push_pending_frames(sk, 1);
> +		if (err == -ENOBUFS && !inet->recverr) {
> +			UDP_INC_STATS_USER(sock_net(sk),
> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> +			err = 0;
> +		}
> +	}
>  	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
>  		up->pending = 0;
>  	release_sock(sk);
> @@ -826,7 +832,7 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> 
>  	up->len += size;
>  	if (!(up->corkflag || (flags&MSG_MORE)))
> -		ret = udp_push_pending_frames(sk);
> +		ret = udp_push_pending_frames(sk, inet_sk(sk)->recverr);
>  	if (!ret)
>  		ret = size;
>  out:
> @@ -1354,7 +1360,7 @@ void udp_destroy_sock(struct sock *sk)
>   */
>  int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>  		       char __user *optval, int optlen,
> -		       int (*push_pending_frames)(struct sock *))
> +		       int (*push_pending_frames)(struct sock *, int))
>  {
>  	struct udp_sock *up = udp_sk(sk);
>  	int val;
> @@ -1374,7 +1380,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
>  		} else {
>  			up->corkflag = 0;
>  			lock_sock(sk);
> -			(*push_pending_frames)(sk);
> +			(*push_pending_frames)(sk, 0);
>  			release_sock(sk);
>  		}
>  		break;
> diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
> index e2325f6..a9c54c2 100644
> --- a/net/ipv6/icmp.c
> +++ b/net/ipv6/icmp.c
> @@ -253,7 +253,7 @@ static int icmpv6_push_pending_frames(struct sock *sk, struct flowi *fl, struct
>  						      len, fl->proto,
>  						      tmp_csum);
>  	}
> -	ip6_push_pending_frames(sk);
> +	ip6_push_pending_frames(sk, 0);
>  out:
>  	return err;
>  }
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index a931229..ade5707 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1440,7 +1440,7 @@ static void ip6_cork_release(struct inet_sock *inet, struct ipv6_pinfo *np)
>  	memset(&inet->cork.fl, 0, sizeof(inet->cork.fl));
>  }
> 
> -int ip6_push_pending_frames(struct sock *sk)
> +int ip6_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct sk_buff *skb, *tmp_skb;
>  	struct sk_buff **tail_skb;
> @@ -1510,18 +1510,22 @@ int ip6_push_pending_frames(struct sock *sk)
> 
>  	err = ip6_local_out(skb);
>  	if (err) {
> -		if (err > 0)
> -			err = np->recverr ? net_xmit_errno(err) : 0;
> +		if (err > 0) {
> +			err = net_xmit_errno(err);
> +			if (err && !recverr) {
> +				IP6_INC_STATS(net, rt->rt6i_idev,
> +					      IPSTATS_MIB_OUTDISCARDS);
> +				err = 0;
> +			}
> +		}
>  		if (err)
> -			goto error;
> +			IP6_INC_STATS(net, rt->rt6i_idev,
> +				      IPSTATS_MIB_OUTDISCARDS);
>  	}
> 
>  out:
>  	ip6_cork_release(inet, np);
>  	return err;
> -error:
> -	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
> -	goto out;
>  }
> 
>  void ip6_flush_pending_frames(struct sock *sk)
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 5068410..d054fa2 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -523,7 +523,7 @@ csum_copy_err:
>  }
> 
>  static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
> -				     struct raw6_sock *rp)
> +				     struct raw6_sock *rp, int recverr)
>  {
>  	struct sk_buff *skb;
>  	int err = 0;
> @@ -595,7 +595,7 @@ static int rawv6_push_pending_frames(struct sock *sk, struct flowi *fl,
>  		BUG();
> 
>  send:
> -	err = ip6_push_pending_frames(sk);
> +	err = ip6_push_pending_frames(sk, recverr);
>  out:
>  	return err;
>  }
> @@ -641,8 +641,13 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
>  	IP6_UPD_PO_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUT, skb->len);
>  	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
>  		      dst_output);
> -	if (err > 0)
> -		err = np->recverr ? net_xmit_errno(err) : 0;
> +	if (err > 0) {
> +		err = net_xmit_errno(err);
> +		if (!np->recverr && err) {
> +			IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
> +			err = 0;
> +		}
> +	}
>  	if (err)
>  		goto error;
>  out:
> @@ -895,7 +900,7 @@ back_from_confirm:
>  		if (err)
>  			ip6_flush_pending_frames(sk);
>  		else if (!(msg->msg_flags & MSG_MORE))
> -			err = rawv6_push_pending_frames(sk, &fl, rp);
> +			err = rawv6_push_pending_frames(sk, &fl, rp, np->recverr);
>  		release_sock(sk);
>  	}
>  done:
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 20d2ffc..963dd0a 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -683,7 +683,7 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
>   *	Sending
>   */
> 
> -static int udp_v6_push_pending_frames(struct sock *sk)
> +static int udp_v6_push_pending_frames(struct sock *sk, int recverr)
>  {
>  	struct sk_buff *skb;
>  	struct udphdr *uh;
> @@ -723,7 +723,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
>  		uh->check = CSUM_MANGLED_0;
> 
>  send:
> -	err = ip6_push_pending_frames(sk);
> +	err = ip6_push_pending_frames(sk, recverr);
>  out:
>  	up->len = 0;
>  	up->pending = 0;
> @@ -975,8 +975,14 @@ do_append_data:
>  		corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
>  	if (err)
>  		udp_v6_flush_pending_frames(sk);
> -	else if (!corkreq)
> -		err = udp_v6_push_pending_frames(sk);
> +	else if (!corkreq) {
> +		err = udp_v6_push_pending_frames(sk, 1);
> +		if (err == -ENOBUFS && !np->recverr) {
> +			UDP6_INC_STATS_USER(sock_net(sk),
> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> +			err = 0;
> +		}
> +	}
>  	else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
>  		up->pending = 0;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 16:11                                                         ` Sridhar Samudrala
@ 2009-09-02 16:20                                                           ` Eric Dumazet
  0 siblings, 0 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-09-02 16:20 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: David Miller, cl, dlstevens, netdev, niv, mtk.manpages

Sridhar Samudrala a écrit :
> On Wed, 2009-09-02 at 16:43 +0200, Eric Dumazet wrote:
>> David Miller a écrit :
>>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>> Date: Mon, 31 Aug 2009 14:09:50 +0200
>>>
>>>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>>>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>>>
>>>> May I suggest following path :
>>>>
>>>> 1) Correct ip6_push_pending_frames() to properly
>>>> account for dropped-by-qdisc frames when IP_RECVERR is set
>>> Your patch is  applied to net-next-2.6, thanks!
>>>
>>>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>>>> but still return a OK to user application, to not break them ?
>>> Sounds good.
>>>
>>> I think if you sample random UDP applications, you will find that such
>>> errors will upset them terribly, make them log tons of crap to
>>> /var/log/messages et al., and consume tons of CPU.
>>>
>>> And in such cases silent ignoring of drops is entirely appropriate and
>>> optimal, which supports our current behavior.
>>>
>>> If we are to make such applications "more sophisticated" such
>>> converted apps can be indicated simply their use of IP_RECVERR.
>>>
>>> If you want to be notified of all asynchronous errors we can detect,
>>> you use this, end of story.  It is the only way to handle this
>>> situation without breaking the world.
>>>
>>> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
>>> accurate, and wise.  And he understood all of this 10+ years ago.
>> Thanks David, here is the 2nd patch then :
>>
>>
>> [PATCH net-next-2.6] ip: Report qdisc packet drops
>>
>> Christoph Lameter pointed out that packet drops at qdisc level where not
>> accounted in SNMP counters. Only if application sets IP_RECVERR, drops
>> are reported to user (-ENOBUFS errors) and SNMP counters updated.
>>
>> IP_RECVERR is used to enable extended reliable error message passing,
>> but these are not needed to update system wide SNMP stats.
>>
>> This patch changes things a bit to allow SNMP counters to be updated,
>> regardless of IP_RECVERR being set or not on the socket.
>>
>> Example after an UDP tx flood
>> # netstat -s 
>> ...
>> IP:
>>     1487048 outgoing packets dropped
>> ...
>> Udp:
>> ...
>>     SndbufErrors: 1487048
>>
> 
> Didn't we agree that qdisc drops should not be counted as IP or UDP 
> drops as David Stevens pointed out?
> I would say that even when IP_RECVERR is set, SNMP counters at IP and
> UDP should not be counted when a packet is dropped at qdisc level,
> but the error can be reported to user.
> 
> Now that qdisc stats issue is figured out and they can be accounted
> and seen at qdisc level, doesn't it confuse if we count the same drop 
> at IP, UDP and qdisc level?
> 
> Thanks
> Sridhar
>

Yes, I am aware of David point, but its already not true with current kernel.

Current kernels and an UDP frame sent by application :

if IP_RECVERR not set, no SNMP error logged, IP or UDP level

if IP_RECVERR is set, qdisc drops are reported both to IP and UDP
SNMP counters.



udp_sendmsg()
{
...
out:
        ip_rt_put(rt);
        if (free)
                kfree(ipc.opt);
        if (!err)
                return len;
        /*
         * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
         * ENOBUFS might not be good (it's not tunable per se), but otherwise
         * we don't have a good statistic (IpOutDiscards but it can be too many
         * things).  We could add another new stat but at least for now that
         * seems like overkill.
         */
        if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
                UDP_INC_STATS_USER(sock_net(sk),
                                UDP_MIB_SNDBUFERRORS, is_udplite);
        }
        return err;
...
}


So what shall we do ?

IMHO, one should not add MIB counters for different domains (IP / UDP), this
makes no sense.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02  1:41                                                     ` David Miller
  2009-09-02 14:43                                                       ` Eric Dumazet
@ 2009-09-02 18:22                                                       ` Christoph Lameter
  2009-09-03  1:09                                                         ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-09-02 18:22 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, sri, dlstevens, netdev, niv, mtk.manpages

On Tue, 1 Sep 2009, David Miller wrote:

> > 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
> > but still return a OK to user application, to not break them ?
>
> Sounds good.

Great. That was my initial suggestion and it would ensure that no apps
break.

> If we are to make such applications "more sophisticated" such
> converted apps can be indicated simply their use of IP_RECVERR.

There may be a minor issue here in that IP_RECVERR sometimes sends error
packets that have to be intercepted using special code. Or can those be
simply ignored? If so then I will ask UDP app vendors to use IP_RECVERR.

> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
> accurate, and wise.  And he understood all of this 10+ years ago.

His code was just slightly buggy .... ;-)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 14:43                                                       ` Eric Dumazet
  2009-09-02 16:11                                                         ` Sridhar Samudrala
@ 2009-09-02 19:37                                                         ` Christoph Lameter
  2009-09-02 16:05                                                           ` Eric Dumazet
  2009-09-02 22:26                                                         ` Eric Dumazet
  2 siblings, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-09-02 19:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, sri, dlstevens, netdev, niv, mtk.manpages


The patch is smaller if you remove the handling of recverr completely from
ip_push_pending_frames() and return NET_RX_DROP etc. Two of the callers
never even inspect the return code. For them this is useless processing.

The others could handle the processing of recverr on their own. Doing so
voids adding code to ip_push_pending_frames() which is latency critical
and also avoids changing the calling conventions.

I have a draft here from our earlier disucssions but its not as
comprehensive as yours.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 14:43                                                       ` Eric Dumazet
  2009-09-02 16:11                                                         ` Sridhar Samudrala
  2009-09-02 19:37                                                         ` Christoph Lameter
@ 2009-09-02 22:26                                                         ` Eric Dumazet
  2009-09-03  1:05                                                           ` David Miller
  2009-09-03 17:57                                                           ` Christoph Lameter
  2 siblings, 2 replies; 73+ messages in thread
From: Eric Dumazet @ 2009-09-02 22:26 UTC (permalink / raw)
  To: David Miller; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Mon, 31 Aug 2009 14:09:50 +0200
>>
>>> Re-reading again this stuff, I realized ip6_push_pending_frames()
>>> was not updating IPSTATS_MIB_OUTDISCARDS, even if IP_RECVERR was set.
>>>
>>> May I suggest following path :
>>>
>>> 1) Correct ip6_push_pending_frames() to properly
>>> account for dropped-by-qdisc frames when IP_RECVERR is set
>> Your patch is  applied to net-next-2.6, thanks!
>>
>>> 2) Submit a patch to account for qdisc-dropped frames in SNMP counters
>>> but still return a OK to user application, to not break them ?
>> Sounds good.
>>
>> I think if you sample random UDP applications, you will find that such
>> errors will upset them terribly, make them log tons of crap to
>> /var/log/messages et al., and consume tons of CPU.
>>
>> And in such cases silent ignoring of drops is entirely appropriate and
>> optimal, which supports our current behavior.
>>
>> If we are to make such applications "more sophisticated" such
>> converted apps can be indicated simply their use of IP_RECVERR.
>>
>> If you want to be notified of all asynchronous errors we can detect,
>> you use this, end of story.  It is the only way to handle this
>> situation without breaking the world.
>>
>> As usual, Alexey Kuznetsov's analysis of this situation is timeless,
>> accurate, and wise.  And he understood all of this 10+ years ago.
> 
> Thanks David, here is the 2nd patch then :

Here is an updated version of the patch, after Christoph comments.



[PATCH net-next-2.6] ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s 
...
IP:
    1487048 outgoing packets dropped
...
Udp:
...
    SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

 "The output queue for a network interface was full.
  This generally indicates that the interface has stopped sending,
  but may be caused by transient congestion.
  (Normally, this does not occur in Linux. Packets are just silently
  dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c  |    2 +-
 net/ipv4/raw.c        |    9 +++++++--
 net/ipv4/udp.c        |   12 +++++++++---
 net/ipv6/ip6_output.c |    2 +-
 net/ipv6/raw.c        |    4 +++-
 net/ipv6/udp.c        |   12 +++++++++---
 6 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..afae0cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1302,7 +1302,7 @@ int ip_push_pending_frames(struct sock *sk)
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2979f14..ebb1e58 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -375,7 +375,7 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	err = NF_HOOK(PF_INET, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
 	if (err > 0)
-		err = inet->recverr ? net_xmit_errno(err) : 0;
+		err = net_xmit_errno(err);
 	if (err)
 		goto error;
 out:
@@ -386,6 +386,8 @@ error_fault:
 	kfree_skb(skb);
 error:
 	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+	if (err == -ENOBUFS && !inet->recverr)
+		err = 0;
 	return err;
 }
 
@@ -576,8 +578,11 @@ back_from_confirm:
 					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
-		else if (!(msg->msg_flags & MSG_MORE))
+		else if (!(msg->msg_flags & MSG_MORE)) {
 			err = ip_push_pending_frames(sk);
+			if (err == -ENOBUFS && !inet->recverr)
+				err = 0;
+		}
 		release_sock(sk);
 	}
 done:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 29ebb0d..ebaaa7f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
 
 send:
 	err = ip_push_pending_frames(sk);
+	if (err) {
+		if (err == -ENOBUFS && !inet->recverr) {
+			UDP_INC_STATS_USER(sock_net(sk),
+					   UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	} else
+		UDP_INC_STATS_USER(sock_net(sk),
+				   UDP_MIB_OUTDATAGRAMS, is_udplite);
 out:
 	up->len = 0;
 	up->pending = 0;
-	if (!err)
-		UDP_INC_STATS_USER(sock_net(sk),
-				UDP_MIB_OUTDATAGRAMS, is_udplite);
 	return err;
 }
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a931229..cd48801 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1511,7 +1511,7 @@ int ip6_push_pending_frames(struct sock *sk)
 	err = ip6_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 5068410..7d675b8 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -642,7 +642,7 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 	err = NF_HOOK(PF_INET6, NF_INET_LOCAL_OUT, skb, NULL, rt->u.dst.dev,
 		      dst_output);
 	if (err > 0)
-		err = np->recverr ? net_xmit_errno(err) : 0;
+		err = net_xmit_errno(err);
 	if (err)
 		goto error;
 out:
@@ -653,6 +653,8 @@ error_fault:
 	kfree_skb(skb);
 error:
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
+	if (err == -ENOBUFS && !np->recverr)
+		err = 0;
 	return err;
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 20d2ffc..1640406 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -724,12 +724,18 @@ static int udp_v6_push_pending_frames(struct sock *sk)
 
 send:
 	err = ip6_push_pending_frames(sk);
+	if (err) {
+		if (err == -ENOBUFS && !inet6_sk(sk)->recverr) {
+			UDP6_INC_STATS_USER(sock_net(sk),
+					    UDP_MIB_SNDBUFERRORS, is_udplite);
+			err = 0;
+		}
+	} else
+		UDP6_INC_STATS_USER(sock_net(sk),
+				    UDP_MIB_OUTDATAGRAMS, is_udplite);
 out:
 	up->len = 0;
 	up->pending = 0;
-	if (!err)
-		UDP6_INC_STATS_USER(sock_net(sk),
-				UDP_MIB_OUTDATAGRAMS, is_udplite);
 	return err;
 }
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 22:26                                                         ` Eric Dumazet
@ 2009-09-03  1:05                                                           ` David Miller
  2009-09-03 17:57                                                           ` Christoph Lameter
  1 sibling, 0 replies; 73+ messages in thread
From: David Miller @ 2009-09-03  1:05 UTC (permalink / raw)
  To: eric.dumazet; +Cc: cl, sri, dlstevens, netdev, niv, mtk.manpages

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 03 Sep 2009 00:26:27 +0200

> Here is an updated version of the patch, after Christoph comments.
> 
> 
> 
> [PATCH net-next-2.6] ip: Report qdisc packet drops

Looks great, applied!

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 18:22                                                       ` Christoph Lameter
@ 2009-09-03  1:09                                                         ` David Miller
  0 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2009-09-03  1:09 UTC (permalink / raw)
  To: cl; +Cc: eric.dumazet, sri, dlstevens, netdev, niv, mtk.manpages

From: Christoph Lameter <cl@linux-foundation.org>
Date: Wed, 2 Sep 2009 13:22:25 -0500 (CDT)

> There may be a minor issue here in that IP_RECVERR sometimes sends error
> packets that have to be intercepted using special code. Or can those be
> simply ignored? If so then I will ask UDP app vendors to use IP_RECVERR.

If you don't set MSG_ERRQUEUE, no special error reports will be
given to the application.

And this only matters for recvmsg() handling.

On send, the only behavior difference is this error code reporting
(and before Eric's patch, SNMP statistics handling).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-03 17:57                                                           ` Christoph Lameter
@ 2009-09-03 14:12                                                             ` Eric Dumazet
  2009-09-03 18:35                                                               ` Christoph Lameter
  0 siblings, 1 reply; 73+ messages in thread
From: Eric Dumazet @ 2009-09-03 14:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, sri, dlstevens, netdev, niv, mtk.manpages

Christoph Lameter a écrit :
> On Thu, 3 Sep 2009, Eric Dumazet wrote:
>> index 29ebb0d..ebaaa7f 100644
>> --- a/net/ipv4/udp.c
>> +++ b/net/ipv4/udp.c
>> @@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
>>
>>  send:
>>  	err = ip_push_pending_frames(sk);
>> +	if (err) {
>> +		if (err == -ENOBUFS && !inet->recverr) {
>> +			UDP_INC_STATS_USER(sock_net(sk),
>> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> 
> This means that we now do not increment SNDBUFERRORS if IP_RECVERR is set.
> 
> I think it would be better to move UDP_INC_STATS_USER before the if
> statement. That will also simplify the code further.
> 
> 

I believe you already said this once Christoph on a previous patch, and I 
replied that in case of IP_RECVERR set, udp_push_pending_frames()
returns -ENOBUFS to its caller : udp_sendmsg()

And the caller takes care of :

out:
        ip_rt_put(rt);
        if (free)
                kfree(ipc.opt);
        if (!err)
                return len;
        /*
         * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
         * ENOBUFS might not be good (it's not tunable per se), but otherwise
         * we don't have a good statistic (IpOutDiscards but it can be too many
         * things).  We could add another new stat but at least for now that
         * seems like overkill.
         */
        if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
                UDP_INC_STATS_USER(sock_net(sk),
                                UDP_MIB_SNDBUFERRORS, is_udplite);
        }
        return err;



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-02 22:26                                                         ` Eric Dumazet
  2009-09-03  1:05                                                           ` David Miller
@ 2009-09-03 17:57                                                           ` Christoph Lameter
  2009-09-03 14:12                                                             ` Eric Dumazet
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Lameter @ 2009-09-03 17:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, sri, dlstevens, netdev, niv, mtk.manpages

On Thu, 3 Sep 2009, Eric Dumazet wrote:
> index 29ebb0d..ebaaa7f 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
>
>  send:
>  	err = ip_push_pending_frames(sk);
> +	if (err) {
> +		if (err == -ENOBUFS && !inet->recverr) {
> +			UDP_INC_STATS_USER(sock_net(sk),
> +					   UDP_MIB_SNDBUFERRORS, is_udplite);

This means that we now do not increment SNDBUFERRORS if IP_RECVERR is set.

I think it would be better to move UDP_INC_STATS_USER before the if
statement. That will also simplify the code further.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next-2.6] ip: Report qdisc packet drops
  2009-09-03 14:12                                                             ` Eric Dumazet
@ 2009-09-03 18:35                                                               ` Christoph Lameter
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Lameter @ 2009-09-03 18:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, sri, dlstevens, netdev, niv, mtk.manpages

On Thu, 3 Sep 2009, Eric Dumazet wrote:

> >> --- a/net/ipv4/udp.c
> >> +++ b/net/ipv4/udp.c
> >> @@ -561,12 +561,18 @@ static int udp_push_pending_frames(struct sock *sk)
> >>
> >>  send:
> >>  	err = ip_push_pending_frames(sk);
> >> +	if (err) {
> >> +		if (err == -ENOBUFS && !inet->recverr) {
> >> +			UDP_INC_STATS_USER(sock_net(sk),
> >> +					   UDP_MIB_SNDBUFERRORS, is_udplite);
> >
> > This means that we now do not increment SNDBUFERRORS if IP_RECVERR is set.
> >
> > I think it would be better to move UDP_INC_STATS_USER before the if
> > statement. That will also simplify the code further.
> >
> >
>
> I believe you already said this once Christoph on a previous patch, and I
> replied that in case of IP_RECVERR set, udp_push_pending_frames()
> returns -ENOBUFS to its caller : udp_sendmsg()

Ahhh. I see. Would it not be better to have a single location where the
SNDBUFERRORS are accounted for? Check for -ENOBUFS before the code in
udp_sendmsg()? Hmmm.. The control flow is a bit complicated there...


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2009-09-03 14:40 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-17 20:01 UDP multicast packet loss not reported if TX ring overrun? Christoph Lameter
2009-08-17 20:40 ` Nivedita Singhvi
2009-08-17 20:46   ` Christoph Lameter
2009-08-17 21:50     ` Sridhar Samudrala
2009-08-17 22:13       ` Christoph Lameter
2009-08-17 22:43         ` Sridhar Samudrala
2009-08-17 22:52           ` Christoph Lameter
2009-08-17 22:57             ` Christoph Lameter
2009-08-18  0:12             ` Sridhar Samudrala
2009-08-18  0:25               ` Christoph Lameter
2009-08-24 17:40               ` Christoph Lameter
2009-08-24 22:02                 ` Eric Dumazet
2009-08-24 22:36                   ` Sridhar Samudrala
2009-08-25 13:48                     ` Christoph Lameter
2009-08-25 19:03                       ` David Stevens
2009-08-25 19:08                         ` David Miller
2009-08-25 19:15                         ` Christoph Lameter
2009-08-25 19:56                           ` Joe Perches
2009-08-25 22:35                           ` Sridhar Samudrala
2009-08-26 14:08                             ` Christoph Lameter
2009-08-26 14:22                               ` Eric Dumazet
2009-08-26 15:27                                 ` Christoph Lameter
2009-08-26 16:29                             ` Christoph Lameter
2009-08-26 17:50                               ` Sridhar Samudrala
2009-08-26 19:09                                 ` Christoph Lameter
2009-08-26 22:11                                   ` Sridhar Samudrala
2009-08-27 15:40                                     ` Christoph Lameter
2009-08-27 20:23                                       ` Christoph Lameter
2009-08-28 13:53                                         ` Christoph Lameter
2009-08-28 15:07                                           ` Eric Dumazet
2009-08-28 16:15                                             ` Christoph Lameter
2009-08-28 17:26                                               ` [PATCH net-next-2.6] ip: Report qdisc packet drops Eric Dumazet
2009-08-29  6:38                                                 ` David Miller
2009-08-31 12:09                                                   ` Eric Dumazet
2009-09-02  1:41                                                     ` David Miller
2009-09-02 14:43                                                       ` Eric Dumazet
2009-09-02 16:11                                                         ` Sridhar Samudrala
2009-09-02 16:20                                                           ` Eric Dumazet
2009-09-02 19:37                                                         ` Christoph Lameter
2009-09-02 16:05                                                           ` Eric Dumazet
2009-09-02 22:26                                                         ` Eric Dumazet
2009-09-03  1:05                                                           ` David Miller
2009-09-03 17:57                                                           ` Christoph Lameter
2009-09-03 14:12                                                             ` Eric Dumazet
2009-09-03 18:35                                                               ` Christoph Lameter
2009-09-02 18:22                                                       ` Christoph Lameter
2009-09-03  1:09                                                         ` David Miller
2009-08-28 19:26                                               ` UDP multicast packet loss not reported if TX ring overrun? David Miller
2009-08-28 20:00                                                 ` Christoph Lameter
2009-08-28 20:04                                                   ` David Miller
2009-08-28 19:24                                           ` David Miller
2009-08-28 19:53                                             ` Christoph Lameter
2009-08-28 20:03                                               ` David Miller
2009-08-28 20:09                                                 ` Christoph Lameter
2009-08-30  0:21                                             ` Mark Smith
2009-08-25 13:46                   ` Christoph Lameter
2009-08-25 13:48                     ` Eric Dumazet
2009-08-25 14:00                       ` Christoph Lameter
2009-08-25 15:32                         ` Eric Dumazet
2009-08-25 15:35                           ` Christoph Lameter
2009-08-25 15:58                             ` Eric Dumazet
2009-08-25 16:11                               ` Christoph Lameter
2009-08-25 16:27                                 ` Eric Dumazet
2009-08-25 16:36                                   ` Christoph Lameter
2009-08-25 16:48                                     ` Eric Dumazet
2009-08-25 17:01                                       ` Christoph Lameter
2009-08-25 17:08                                         ` Eric Dumazet
2009-08-25 17:44                                           ` Christoph Lameter
2009-08-25 17:53                                           ` Christoph Lameter
2009-08-25 18:38                                       ` Sridhar Samudrala
2009-08-24 23:14             ` Eric Dumazet
2009-08-25  6:46               ` Eric Dumazet
2009-08-25 13:45               ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.